-
Notifications
You must be signed in to change notification settings - Fork 31.3k
🚨 Clean up image-text-to-text pipeline
#42430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
🚨 Clean up image-text-to-text pipeline
#42430
Conversation
Co-authored-by: Anton Vlasjuk <[email protected]>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: bark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp I started a PR on how image-text-to-text can be further cleaned up! What do you think about altogether removing the images input to __call__? That could significantly clean up the logic there
don't mind the changes on the other files, I'm waiting for #42326 to be merged, but currently blocked because of unrelated failing test it's merged!
|
|
||
| # encourage the user to use the chat format if supported | ||
| if getattr(self.processor, "chat_template", None) is not None: | ||
| logger.warning_once( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be stricter to force chat template usage?
| # We have one or more prompts in list-of-dicts format, so this is chat mode | ||
| if isinstance(text[0], dict): | ||
| return super().__call__(Chat(text), **kwargs) | ||
| else: | ||
| chats = [Chat(chat) for chat in text] # 🐈 🐈 🐈 | ||
| return super().__call__(chats, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can let base class handle casting to Chat
| # We have one or more prompts in list-of-dicts format, so this is chat mode | ||
| if isinstance(images[0], dict): | ||
| return super().__call__(Chat(images), **kwargs) | ||
| else: | ||
| chats = [Chat(image) for image in images] # 🐈 🐈 🐈 | ||
| return super().__call__(chats, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can let base class handle casting to Chat
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
image-text-to-text pipelineimage-text-to-text pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp I merged with main after #42326 was merged, and did some more cleanup/fixes 🙈
including reintroducing the OpenAI chat format conversion. Btw latest version seems to be different than what was currently in the tests? (explanations/links in my comments)
I also fixed some tests 🙂
| @slow | ||
| def test_small_model_pt_token_text_only(self): | ||
| pipe = pipeline("any-to-any", model="google/gemma-3n-E4B-it") | ||
| text = "What is the capital of France? Assistant:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to chat template usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need a test to check if the pipe can work in text-only mode imo. Users might want to pass only text sometimes and continue a multi-turn conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was there already a test for this? or did I remove it accidentally? 🙈
| @require_torch | ||
| def test_small_model_pt_token_text_only(self): | ||
| pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf") | ||
| text = "What is the capital of France? Assistant:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to chat template usage
| image = "./tests/fixtures/tests_samples/COCO/000000039769.png" | ||
| text = "<image> What this is? Assistant: This is" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to chat template usage
| "type": "input_image", | ||
| "image_url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg", | ||
| }, | ||
| {"type": "text", "text": "Describe this image in one sentence."}, | ||
| {"type": "input_text", "text": "Describe this image in one sentence."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps OpenAI structure changed since the test was created? https://platform.openai.com/docs/guides/images-vision?api-mode=responses#giving-a-model-images-as-input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, right! W never got bug reports on this, and I just realized we don't promote openAI format in docs. Makes me wonder, should we keep supporting it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards no for less maintenance overhead? (esp if indeed OpenAI changes their structure)
If people didn't complain about it not working, I can only guess that it wasn't used much.
@Rocketknight1 may have thoughts on this?
| # Convert OpenAI fields to Transformers fields | ||
| for content in message["content"]: | ||
| if isinstance(content, dict): | ||
| content_type = content.get("type") | ||
| # (27 Nov 2025) Image/vision fields: https://platform.openai.com/docs/guides/images-vision | ||
| if content_type == "input_image": | ||
| content["type"] = "image" | ||
| content["image"] = content.pop("image_url") | ||
| if content_type == "input_text": | ||
| content["type"] = "text" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-introduced the conversion (deleted in #42359 (review)) according to structure of input dict on this page: https://platform.openai.com/docs/guides/images-vision?api-mode=responses#giving-a-model-images-as-input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linked to this thread if we keep OpenAI conversion: https://github.com/huggingface/transformers/pull/42430/files#r2568843132
|
The last commit is more breaking to enforce chat template usage if the model supports it. |
| self, | ||
| image: Union[str, "Image.Image"] | None = None, | ||
| text: str | None = None, | ||
| images: Union[str, "Image.Image"] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be a bit breaking if someone used to pass positional args, but I see the reason behind. Prob can be squeezed in v5-release
Also, we'd need to update docs, afair there are examples with positional args for the pipe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely breaking!
It was an idea for phasing out the images argument and nudging chat template usage (linked to this thread). Because I thought it makes more sense as the text argument rather than the images argument for usage like so?
outputs = pipe(chat_template)rather than:
outputs = pipe(text=chat_template)But could also keep the original order and only support chat templates passed to images and not to text? Namely adjusting this logic to raise an error if images is not a chat and text is not None.
Which would be similar to this code path in the original:
| elif text is None and _is_chat(images): |
| if getattr(self.processor, "chat_template", None) is not None: | ||
| if images is not None: | ||
| raise ValueError( | ||
| "Invalid input: you passed `chat` and `images` as separate input arguments. " | ||
| "Images must be placed inside the chat message's `content`. For example, " | ||
| "The model supports chat templates and you passed an `images` argument. A chat template must be " | ||
| "passed with images placed inside the chat message's `content`. For example, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed internally, not super sure about this. Pipes are designed for beginners so I wouldn't expect them to pass custom formatted prompts
It'd be nice to get more opinions here, maybe @molbap @Rocketknight1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, and yeah it's as a balance between convenience (for users) and maintenance (for us).
For TTS, we decided (for now) to not have audio inputs (e.g. for voice cloning) so that the __call__ method is much leaner:
| def __call__(self, text_inputs, **forward_params): |
and let model's chat template do it's work to properly extract the audio/text before calling the processor.
But we also never had audio as input so it wasn't a breaking change 🙏
If the chat template usage is well-documented, I guess it should still be fine for beginners? But I leave for you to decide, as you are way more knowledgeable about the image-text-to-text models and how the community prefers using them!
| @slow | ||
| def test_small_model_pt_token_text_only(self): | ||
| pipe = pipeline("any-to-any", model="google/gemma-3n-E4B-it") | ||
| text = "What is the capital of France? Assistant:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need a test to check if the pipe can work in text-only mode imo. Users might want to pass only text sometimes and continue a multi-turn conversation
| "type": "input_image", | ||
| "image_url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg", | ||
| }, | ||
| {"type": "text", "text": "Describe this image in one sentence."}, | ||
| {"type": "input_text", "text": "Describe this image in one sentence."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, right! W never got bug reports on this, and I just realized we don't promote openAI format in docs. Makes me wonder, should we keep supporting it ?
What does this PR do?
This PR standardizes / cleans up
image-to-text-textpipeline to use the chat template logic from the base pipeline class.TODO:
add_images_to_messages(see here)