Update sample_encoder.py #14547

srkpmail · 2025-08-21T13:36:10Z

What does this PR do ?

We have observed more than one initial tokens are mismatched between tokens and answer_tokens due to tokenizer's additional prefix space.
This leads to additional warnings with labels not computed, inconsistence in data preprocessing.
This is addressed by turning off the prefix space.

For Instance, consider the below
tokens decoded:
USER: What is in the photo? ....... ASSISTANT: cute denim shift dress in blue acid wash

answer_tokens decoded:
cute denim shift dress in blue acid wash

while tokens have [..........., '_cute', '_den', 'im', '_shift', '_dress',........]
but answer_tokens have [ 'c', 'ute', '_den', 'im', '_shift', '_dress',......]

which is a mismatch leads incomplete in compute labels

Collection: [multimodal]

More than one initial tokens are mismatched between `tokens` and `answer_tokens` due to tokenizer's additional prefix space. This leads to additional warnings with labels not computed, inconsistence in data training. This is addressed by turning off the prefix space. Signed-off-by: Shiva Rama Krishna Parvatham <[email protected]>

Signed-off-by: srkpmail <[email protected]>

github-actions bot added the Multi Modal label Aug 21, 2025

srkpmail and others added 3 commits August 21, 2025 13:36

Apply isort and black reformatting

01d4a13

Signed-off-by: srkpmail <[email protected]>

Merge branch 'main' into patch-1

81921bb

Merge branch 'main' into patch-1

0f16be1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update sample_encoder.py #14547

Update sample_encoder.py #14547

srkpmail commented Aug 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Update sample_encoder.py #14547

Are you sure you want to change the base?

Update sample_encoder.py #14547

Conversation

srkpmail commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Uh oh!

Uh oh!

srkpmail commented Aug 21, 2025 •

edited

Loading