Skip to content

Attention loss #37

@20IM30007

Description

@20IM30007
Ground Truth Cross-Attention

1) We define the cross-attention ground truth for tokens as the L2-normalized vector, where:
       a) A value of 1 indicates that the word is active according to the word-level ground truth timestamp.
       b) A value of 0 indicates that no attention should be paid.
2) To account for small inaccuracies in the ground truth timestamps, we apply a linear interpolation of 4 steps (8 milliseconds) on both sides of the ground truth vector, transitioning smoothly from 0 to 1.

Here 4 steps corresponds to 80 milliseconds right? 1 frame of encoder corresponds to 20 milliseconds right?

30 * 1000 / 1500 ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions