Skip to content

Conversation

@ebezzam
Copy link
Contributor

@ebezzam ebezzam commented Nov 26, 2025

What does this PR do?

This PR standardizes / cleans up image-to-text-text pipeline to use the chat template logic from the base pipeline class.

TODO:

@ebezzam ebezzam marked this pull request as draft November 26, 2025 18:09
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: bark

Copy link
Contributor Author

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp I started a PR on how image-text-to-text can be further cleaned up! What do you think about altogether removing the images input to __call__? That could significantly clean up the logic there

don't mind the changes on the other files, I'm waiting for #42326 to be merged, but currently blocked because of unrelated failing test it's merged!


# encourage the user to use the chat format if supported
if getattr(self.processor, "chat_template", None) is not None:
logger.warning_once(
Copy link
Contributor Author

@ebezzam ebezzam Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be stricter to force chat template usage?

Comment on lines -293 to -287
# We have one or more prompts in list-of-dicts format, so this is chat mode
if isinstance(text[0], dict):
return super().__call__(Chat(text), **kwargs)
else:
chats = [Chat(chat) for chat in text] # 🐈 🐈 🐈
return super().__call__(chats, **kwargs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can let base class handle casting to Chat

Comment on lines -303 to -297
# We have one or more prompts in list-of-dicts format, so this is chat mode
if isinstance(images[0], dict):
return super().__call__(Chat(images), **kwargs)
else:
chats = [Chat(image) for image in images] # 🐈 🐈 🐈
return super().__call__(chats, **kwargs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can let base class handle casting to Chat

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ebezzam ebezzam changed the title Clean up image-text-to-text pipeline 🚨 Clean up image-text-to-text pipeline Nov 26, 2025
Copy link
Contributor Author

@ebezzam ebezzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp I merged with main after #42326 was merged, and did some more cleanup/fixes 🙈

including reintroducing the OpenAI chat format conversion. Btw latest version seems to be different than what was currently in the tests? (explanations/links in my comments)

I also fixed some tests 🙂

@slow
def test_small_model_pt_token_text_only(self):
pipe = pipeline("any-to-any", model="google/gemma-3n-E4B-it")
text = "What is the capital of France? Assistant:"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to chat template usage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need a test to check if the pipe can work in text-only mode imo. Users might want to pass only text sometimes and continue a multi-turn conversation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was there already a test for this? or did I remove it accidentally? 🙈

@require_torch
def test_small_model_pt_token_text_only(self):
pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
text = "What is the capital of France? Assistant:"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to chat template usage

Comment on lines -144 to -145
image = "./tests/fixtures/tests_samples/COCO/000000039769.png"
text = "<image> What this is? Assistant: This is"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to chat template usage

Comment on lines +354 to +357
"type": "input_image",
"image_url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
},
{"type": "text", "text": "Describe this image in one sentence."},
{"type": "input_text", "text": "Describe this image in one sentence."},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, right! W never got bug reports on this, and I just realized we don't promote openAI format in docs. Makes me wonder, should we keep supporting it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards no for less maintenance overhead? (esp if indeed OpenAI changes their structure)

If people didn't complain about it not working, I can only guess that it wasn't used much.

@Rocketknight1 may have thoughts on this?

Comment on lines +587 to +596
# Convert OpenAI fields to Transformers fields
for content in message["content"]:
if isinstance(content, dict):
content_type = content.get("type")
# (27 Nov 2025) Image/vision fields: https://platform.openai.com/docs/guides/images-vision
if content_type == "input_image":
content["type"] = "image"
content["image"] = content.pop("image_url")
if content_type == "input_text":
content["type"] = "text"
Copy link
Contributor Author

@ebezzam ebezzam Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-introduced the conversion (deleted in #42359 (review)) according to structure of input dict on this page: https://platform.openai.com/docs/guides/images-vision?api-mode=responses#giving-a-model-images-as-input

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linked to this thread if we keep OpenAI conversion: https://github.com/huggingface/transformers/pull/42430/files#r2568843132

@ebezzam
Copy link
Contributor Author

ebezzam commented Nov 27, 2025

The last commit is more breaking to enforce chat template usage if the model supports it.

@ebezzam ebezzam marked this pull request as ready for review November 27, 2025 15:10
self,
image: Union[str, "Image.Image"] | None = None,
text: str | None = None,
images: Union[str, "Image.Image"] | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be a bit breaking if someone used to pass positional args, but I see the reason behind. Prob can be squeezed in v5-release

Also, we'd need to update docs, afair there are examples with positional args for the pipe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely breaking!

It was an idea for phasing out the images argument and nudging chat template usage (linked to this thread). Because I thought it makes more sense as the text argument rather than the images argument for usage like so?

outputs = pipe(chat_template)

rather than:

outputs = pipe(text=chat_template)

But could also keep the original order and only support chat templates passed to images and not to text? Namely adjusting this logic to raise an error if images is not a chat and text is not None.

Which would be similar to this code path in the original:

elif text is None and _is_chat(images):

Comment on lines +270 to +274
if getattr(self.processor, "chat_template", None) is not None:
if images is not None:
raise ValueError(
"Invalid input: you passed `chat` and `images` as separate input arguments. "
"Images must be placed inside the chat message's `content`. For example, "
"The model supports chat templates and you passed an `images` argument. A chat template must be "
"passed with images placed inside the chat message's `content`. For example, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed internally, not super sure about this. Pipes are designed for beginners so I wouldn't expect them to pass custom formatted prompts

It'd be nice to get more opinions here, maybe @molbap @Rocketknight1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, and yeah it's as a balance between convenience (for users) and maintenance (for us).

For TTS, we decided (for now) to not have audio inputs (e.g. for voice cloning) so that the __call__ method is much leaner:

def __call__(self, text_inputs, **forward_params):

and let model's chat template do it's work to properly extract the audio/text before calling the processor.

But we also never had audio as input so it wasn't a breaking change 🙏

If the chat template usage is well-documented, I guess it should still be fine for beginners? But I leave for you to decide, as you are way more knowledgeable about the image-text-to-text models and how the community prefers using them!

@slow
def test_small_model_pt_token_text_only(self):
pipe = pipeline("any-to-any", model="google/gemma-3n-E4B-it")
text = "What is the capital of France? Assistant:"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need a test to check if the pipe can work in text-only mode imo. Users might want to pass only text sometimes and continue a multi-turn conversation

Comment on lines +354 to +357
"type": "input_image",
"image_url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
},
{"type": "text", "text": "Describe this image in one sentence."},
{"type": "input_text", "text": "Describe this image in one sentence."},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, right! W never got bug reports on this, and I just realized we don't promote openAI format in docs. Makes me wonder, should we keep supporting it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants