-
Notifications
You must be signed in to change notification settings - Fork 46
Open
Description
Ground Truth Cross-Attention
1) We define the cross-attention ground truth for tokens as the L2-normalized vector, where:
a) A value of 1 indicates that the word is active according to the word-level ground truth timestamp.
b) A value of 0 indicates that no attention should be paid.
2) To account for small inaccuracies in the ground truth timestamps, we apply a linear interpolation of 4 steps (8 milliseconds) on both sides of the ground truth vector, transitioning smoothly from 0 to 1.
Here 4 steps corresponds to 80 milliseconds right? 1 frame of encoder corresponds to 20 milliseconds right?
30 * 1000 / 1500 ?
Metadata
Metadata
Assignees
Labels
No labels