Hi, thanks for your code! I have some questions about the model.
When we construct the prototype matrix(N_l x N_p x D), the 1xD vectors in it is derived from the whole image/sentence;
However, when conducting subsequent operations of the Cross-modal Prototype Querying and the Cross-modal Prototype Responding, it is to look for the most suitable vector in the prototype matrix for each patch or word. Does this sound not so matching? image -patch, sentence - word?
Hi, thanks for your code! I have some questions about the model.
When we construct the prototype matrix(N_l x N_p x D), the 1xD vectors in it is derived from the whole image/sentence;
However, when conducting subsequent operations of the Cross-modal Prototype Querying and the Cross-modal Prototype Responding, it is to look for the most suitable vector in the prototype matrix for each patch or word. Does this sound not so matching? image -patch, sentence - word?