Could you provide the code for visualizing attention in Figure 2, or help us identify if there are any issues with our approach?

Thank you for your excellent work. We have a question regarding the attention visualization in Figure 2 of your paper. 

We attempted to reproduce the visualization using the following approach:
- Taking the last transformer layer
- Summing across all attention heads

And an example of our method is below. However, our results differ significantly from yours. Even when I set my code to be the first head of the first transformer layer, I can't get such a high score and such a significant distribution pattern like yours.

Could you help us understand, if there might be any specific preprocess or normalization steps we're missing?

To help us better understand and reproduce your results, would it be possible to share the visualization code you used? This would be incredibly helpful for our research.

Thank you for your time and assistance.

<img width="665" alt="屏幕截图 2024-12-11 145016" src="https://github.com/user-attachments/assets/b09ada49-8039-48f1-8f43-12e56d7624e2">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Could you provide the code for visualizing attention in Figure 2, or help us identify if there are any issues with our approach? #90

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Could you provide the code for visualizing attention in Figure 2, or help us identify if there are any issues with our approach? #90

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions