Hi @wenbowen123 , thanks for the great work!
I have a question about the Disparity Transformer. In FlashMultiheadAttention.forward, after projecting Q/K/V:
|
Q = Q.view(Q.size(0), Q.size(1), self.num_heads, self.head_dim) |
Since F.scaled_dot_product_attention expects (..., num_heads, seq_len, embed_dim), it seems like attention is being computed over heads=4 as the sequence dimension, rather than over disparity candidates.
Am I missing something here?
Hi @wenbowen123 , thanks for the great work!
I have a question about the Disparity Transformer. In FlashMultiheadAttention.forward, after projecting Q/K/V:
FoundationStereo/core/submodule.py
Line 216 in 6e88068
Since F.scaled_dot_product_attention expects (..., num_heads, seq_len, embed_dim), it seems like attention is being computed over heads=4 as the sequence dimension, rather than over disparity candidates.
Am I missing something here?