Skip to content

Conversation

srkpmail
Copy link

@srkpmail srkpmail commented Aug 21, 2025

What does this PR do ?

We have observed more than one initial tokens are mismatched between tokens and answer_tokens due to tokenizer's additional prefix space.
This leads to additional warnings with labels not computed, inconsistence in data preprocessing.
This is addressed by turning off the prefix space.

For Instance, consider the below
tokens decoded:
USER: What is in the photo? ....... ASSISTANT: cute denim shift dress in blue acid wash

answer_tokens decoded:
cute denim shift dress in blue acid wash

while tokens have [..........., '_cute', '_den', 'im', '_shift', '_dress',........]
but answer_tokens have [ 'c', 'ute', '_den', 'im', '_shift', '_dress',......]

which is a mismatch leads incomplete in compute labels

Collection: [multimodal]

More than one initial tokens are mismatched between `tokens` and `answer_tokens` due to tokenizer's additional prefix space.
This leads to additional warnings with labels not computed, inconsistence in data training.
This is addressed by turning off the prefix space.

Signed-off-by: Shiva Rama Krishna Parvatham <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant