Fix processor usage + add chat_template support to TTS pipeline, and shift common chat template logic to base class. #42326

ebezzam · 2025-11-21T14:36:08Z

What does this PR do?

Processor usage within TTS pipelines is faulty.

Moreover, it does not support chat template inputs with/without audio, which is of interest for the following models for generating conversations:

CSM (already in Transformers)
Higgs (upcoming in Transformers, cc @eustlb)
VibeVoice (upcoming in Transformers)

Example usage for CSM and inputs like such for VibeVoice:

conversation = [
    {"role": "0", "content": [
        {"type": "text", "text": "Hello everyone, and welcome to the VibeVoice podcast. I'm your host, Linda, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Thomas here to talk about it with me."},
        {"type": "audio", "path": https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav}
    ]},
    {"role": "1", "content": [
        {"type": "text", "text": "Thanks so much for having me, Linda. You're absolutely right—this question always brings out some seriously strong feelings."},
        {"type": "audio", "path": https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Frank_man.wav}
    ]},
    {"role": "0", "content": [
        {"type": "text", "text": "Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the Finals, six championships. That kind of perfection is just incredible."},
    ]},
    {"role": "1", "content": [
        {"type": "text", "text": "Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it"},
    ]},
]

This PR is related to #39796, but this one is different/simpler (and I'd say more urgent) in its objective 👉 fixing processor usage and enabling chat_template inputs like above.

So I think it's worth a separate PR as #39796 requires more testing/review.

Context for current error

This line fails when trying to use the processor because it isn’t loaded (so it is None instead)

Minimal failing example:

import torch
from transformers import pipeline

model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("text-to-speech", model=model_id, device=device, no_processor=False)

# prepare the inputs
text = "[0]Hello from Sesame." # `[0]` for speaker id 0

# apply pipeline
output = pipe(text)

"""
Traceback (most recent call last):
  File "/home/eric_bezzam/transformers/src/transformers/pipelines/test_pipeline_chat_template.py", line 34, in <module>
    output = pipe(text)
             ^^^^^^^^^^
  File "/home/eric_bezzam/transformers/src/transformers/pipelines/text_to_audio.py", line 218, in __call__
    return super().__call__(text_inputs, **forward_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eric_bezzam/transformers/src/transformers/pipelines/base.py", line 1261, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eric_bezzam/transformers/src/transformers/pipelines/base.py", line 1267, in run_single
    model_inputs = self.preprocess(inputs, **preprocess_params)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eric_bezzam/transformers/src/transformers/pipelines/text_to_audio.py", line 153, in preprocess
    output = preprocessor(text, **kwargs, return_tensors="pt")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable
"""

After changes

No need to explicitly ask for processor usage (auto-detected internally).

TTS example

import torch
from transformers import pipeline
import soundfile as sf

model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("text-to-speech", model=model_id, device=device)

# prepare the inputs
text = "[0]Hello from Sesame." # `[0]` for speaker id 0

# apply pipeline
output = pipe(text, generate_kwargs={"output_audio": True})

# save the audio to a file
audio = output["audio"][0].squeeze()
fn = "csm_pipeline_output.wav"
sf.write(fn, audio, output["sampling_rate"])
print(f"Audio saved to {fn}")

Conversation example with chat template, which was not possible before! Again auto-detected like in text-generation pipeline. Below example mimics this usage from CSM:

import torch
from transformers import pipeline
import soundfile as sf
from datasets import load_dataset, Audio

model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("text-to-speech", model=model_id, device=device)

# prepare the inputs like here: https://huggingface.co/sesame/csm-1b#csm-sounds-best-when-provided-with-context
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = []
for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
    conversation.append(
        {
            "role": f"{speaker_id}",
            "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
        }
    )
conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})

# apply pipeline
output = pipe(conversation, generate_kwargs={"output_audio": True})

# save the audio to a file
audio = output["audio"][0].squeeze()
fn = "csm_pipeline_output_chat.wav"
sf.write(fn, audio, output["sampling_rate"])
print(f"Audio saved to {fn}")

@Rocketknight1 since it is pipeline related you can take a look if you want! but will definitely ask @vasqu an @eustlb for audio-specific feedback 🙂

ebezzam

@vasqu Self-review with pointers that hopefully helps!

ebezzam · 2025-11-21T14:38:41Z

src/transformers/pipelines/text_to_audio.py

+# Copied from transformers.pipelines.text_generation
+ChatType = list[dict[str, str]]
+
+
+# Copied from transformers.pipelines.text_generation
+class Chat:
+    """This class is intended to just be used internally in this pipeline and not exposed to users. We convert chats
+    to this format because the rest of the pipeline code tends to assume that lists of messages are
+    actually a batch of samples rather than messages in the same conversation."""
+
+    def __init__(self, messages: dict):
+        for message in messages:
+            if not ("role" in message and "content" in message):
+                raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.")
+        self.messages = messages
+
+


Essentially coming from text generation: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_generation.py#L17

src/transformers/pipelines/text_to_audio.py

ebezzam · 2025-11-21T14:42:38Z

src/transformers/pipelines/text_to_audio.py

+        if isinstance(
+            text_inputs,
+            (list, tuple, types.GeneratorType)
+            if is_torch_available()
+            else (list, tuple, types.GeneratorType),
+        ):
+            if isinstance(text_inputs, types.GeneratorType):
+                text_inputs, _ = itertools.tee(text_inputs)
+                text_inputs, first_item = (x for x in text_inputs), next(_)
+            else:
+                first_item = text_inputs[0]
+            if isinstance(first_item, (list, tuple, dict)):
+                # We have one or more prompts in list-of-dicts format, so this is chat mode
+                if isinstance(first_item, dict):
+                    return super().__call__(Chat(text_inputs), **forward_params)
+                else:
+                    chats = (Chat(chat) for chat in text_inputs)
+                    if isinstance(text_inputs, types.GeneratorType):
+                        return super().__call__(chats, **forward_params)
+                    else:
+                        return super().__call__(list(chats), **forward_params)


Similarly from text generation:

transformers/src/transformers/pipelines/text_generation.py

Line 311 in f15b95e

if isinstance(

ebezzam · 2025-11-21T14:42:56Z

src/transformers/pipelines/text_to_audio.py

+        elif isinstance(audio, tuple):
+            waveform = audio[0]
        else:
-            waveform = self.processor.decode(audio)


Was breaking for CSM

Iirc it was for Dia but it has been broken way too often.

Reopening because I want to verify that Dia works with this current version - I'm pretty sure we need the processor to decode for Dia which is why I wrote the initial long message on how we plan to standardize

Everything handled by the model, audio tokenizer within it already

Separate model / tokenizer, processor handles encoding/decoding into codebooks/waveform

Good point, Dia does need the processor for decoding and I'll also add a unit test for Dia so we don't miss this in the future.

However, I feel like a blanket self.processor.decode might be too broad. For example, CSM and VibeVoice don't require the processor to decode.

Since there is not standard approach (yet), how about something like below (which is working):

if isinstance(audio, dict): waveform = audio[waveform_key] elif isinstance(audio, tuple): waveform = audio[0] elif self.model.config.model_type in ["dia"]: # models that require decoding, e.g. with codec waveform = self.processor.decode(audio) else: waveform = audio

Example usage:

from transformers import pipeline import soundfile as sf dia_pipeline = pipeline( "text-to-audio", model=model_checkpoint, ) outputs = dia_pipeline( "[S1] Dia is an open weights text to dialogue model.", generate_kwargs={"max_new_tokens": 512}, ) assert outputs["sampling_rate"] == 44100 audio = outputs["audio"].squeeze() fn = "dia_pipeline_output.wav" sf.write(fn, audio, outputs["sampling_rate"]) print(f"Audio saved to {fn}")

I'm reluctant to allow voice cloning through pipeline, as this would require passing an audios input to pipeline (since Dia doesn't support chat templates).

Moreover, allowing inputs like audios is exactly what they are trying to phase out with image-text-to-text in #42359 (to only support chat template usage).

That's a good point on voice cloning, we should maybe update Dia with a chat template in the future. I did not have that in mind at that point, that's on me.

Re: standards. Yea, we have no choice atm - it's more of a question on how we handle this in the future

tests/pipelines/test_pipelines_text_to_audio.py

HuggingFaceDocBuilderDev · 2025-11-21T14:44:21Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ebezzam · 2025-11-21T15:13:07Z

Failed tests are unrelated, concatenating below:

# https://app.circleci.com/pipelines/github/huggingface/transformers/154274/workflows/48600cd6-5d2a-455a-a2e0-96f3b9bae8ae/jobs/2028466
FAILED tests/models/efficientloftr/test_image_processing_efficientloftr.py::EfficientLoFTRImageProcessingTest::test_post_processing_keypoint_matching_with_padded_match_indices - AssertionError: 2 != 1
===== 1 failed, 549 passed, 362 skipped, 24 warnings in 115.22s (0:01:55) ======

# https://app.circleci.com/pipelines/github/huggingface/transformers/154274/workflows/48600cd6-5d2a-455a-a2e0-96f3b9bae8ae/jobs/2028474
FAILED tests/models/resnet/test_modeling_resnet.py::ResNetModelTest::test_can_load_ignoring_mismatched_shapes - AssertionError: 0.14109472930431366 not less than or equal to 0.1 : Issue with classifier.1.bias
==== 1 failed, 3959 passed, 6611 skipped, 93 warnings in 187.89s (0:03:07) =====

# https://app.circleci.com/pipelines/github/huggingface/transformers/154274/workflows/48600cd6-5d2a-455a-a2e0-96f3b9bae8ae/jobs/2028467
FAILED tests/models/olmo/test_modeling_olmo.py::OlmoModelTest::test_generate_with_static_cache - AssertionError: False is not true
===== 1 failed, 625 passed, 218 skipped, 16 warnings in 118.43s (0:01:58) ======

vasqu

Left some comments 🤗 I think it's overall fine more so nits to have more alignment with other pipelines.

My only gripe is that we still don't have a standard way how models generate audio. For example, CSM directly generates the audio waveform but that's because it uses the audio tokenizer directly within the model itself. Dia does not do that and depends on the processor do decode into waveform. This is something we have to properly enfore at some point before we get too many exceptions cc @eustlb

vasqu · 2025-11-21T15:59:40Z

src/transformers/pipelines/text_to_audio.py

+# Copied from transformers.pipelines.text_generation.Chat
+class Chat:
+    """This class is intended to just be used internally in this pipeline and not exposed to users. We convert chats
+    to this format because the rest of the pipeline code tends to assume that lists of messages are
+    actually a batch of samples rather than messages in the same conversation."""
+
+    def __init__(self, messages: dict):
+        for message in messages:
+            if not ("role" in message and "content" in message):
+                raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.")
+        self.messages = messages


I feel like chat templates with any modality are increasingly more important, might be better to move this somewhere more general (and let it be imported). We gonna have audio, image, text already atp

And I would like to avoid copied from tbh

Currently two other pipelines using this:

text generation

image-text-to-text, which calls additional add_images_to_messages

We could combine into a single Chat object in base.py like so?

class Chat: """This class is intended to just be used internally in this pipeline and not exposed to users. We convert chats to this format because the rest of the pipeline code tends to assume that lists of messages are actually a batch of samples rather than messages in the same conversation.""" def __init__( self, messages: dict, images: Union[str, list[str], "Image.Image", list["Image.Image"]] | None = None ): for message in messages: if not ("role" in message and "content" in message): raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.") if images is not None: messages = add_images_to_messages(messages, images) self.messages = messages

For audio models with @eustlb, our chat templates already allow audio in the message and the jinja template (like this and this) handles extracting audio when calling apply_chat_template (which then calls the processor internally, see here)

SGTM, @Rocketknight1 @zucchini-nlp wdyt about this? Not sure who has a better overview on (image) pipelines.

Yes, you can move it to base.py or utils/chat_template_utils.py, whichever is a cleaner import. Not a huge issue either way, though!

thanks @Rocketknight1! I've put in utils/chat_template_utils.py for now

For image-text-to-text's current Chat object (here), it could be nice to do something like below in the general purpose Chat to support the images input:

class Chat: """This class is intended to just be used internally for pipelines and not exposed to users. We convert chats to this format because the rest of the pipeline code tends to assume that lists of messages are actually a batch of samples rather than messages in the same conversation.""" def __init__(self, messages: dict, images: Union[str, list[str], "Image.Image", list["Image.Image"]] | None = None): for message in messages: if not ("role" in message and "content" in message): raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.") if images is not None: messages = add_images_to_messages(messages, images) self.messages = messages

But actually this fails for a current usage 👉 when there is an image URL in the chat template (this code path).

FYI, I found out above this edge case, because this test would fail when I tried a modified Chat object like above (because the image wouldn't be properly loaded).

@zucchini-nlp do you have an idea of how image-text-to-text could also use the general purpose Chat object without having to call add_images_to_messages to avoid the current double for-loop if there is indeed no image input?

@vasqu, @Rocketknight1 for reference @zucchini-nlp started a PR to handle image-text-to-text so I won't touch it in this PR

src/transformers/pipelines/text_to_audio.py

vasqu · 2025-11-21T16:14:32Z

src/transformers/pipelines/text_to_audio.py

+            if isinstance(text_inputs, types.GeneratorType):
+                text_inputs, _ = itertools.tee(text_inputs)
+                text_inputs, first_item = (x for x in text_inputs), next(_)
+            else:
+                first_item = text_inputs[0]
+            if isinstance(first_item, (list, tuple, dict)):
+                # We have one or more prompts in list-of-dicts format, so this is chat mode
+                if isinstance(first_item, dict):
+                    return super().__call__(Chat(text_inputs), **forward_params)
+                else:
+                    chats = (Chat(chat) for chat in text_inputs)
+                    if isinstance(text_inputs, types.GeneratorType):
+                        return super().__call__(chats, **forward_params)
+                    else:
+                        return super().__call__(list(chats), **forward_params)


I honestly like this, it's short and sweet but I feel like we should maybe align with image e.g.

transformers/src/transformers/pipelines/image_text_to_text.py

Lines 331 to 352 in ce7a5e0

def _is_chat(arg):

return isinstance(arg, (list, tuple, KeyDataset)) and isinstance(arg[0], (list, tuple, dict))

if _is_chat(text):

# We have one or more prompts in list-of-dicts format, so this is chat mode

if isinstance(text[0], dict):

return super().__call__(Chat(text, images), **kwargs)

else:

if images is None:

images = [None] * len(text)

chats = [Chat(chat, image) for chat, image in zip(text, images)] # 🐈 🐈 🐈

return super().__call__(chats, **kwargs)

# Same as above, but the `images` argument contains the chat. This can happen e.g. is the user only passes a

# chat as a positional argument.

elif text is None and _is_chat(images):

# We have one or more prompts in list-of-dicts format, so this is chat mode

if isinstance(images[0], dict):

return super().__call__(Chat(images), **kwargs)

else:

chats = [Chat(image) for image in images] # 🐈 🐈 🐈

return super().__call__(chats, **kwargs)

We cooking all our own soup 😢

Only thing I'd change would be avoid using a wildcard _ and give an explicit name instead

As discussed on Slack, the __call__ logic of image-text-to-text may be more complicated because they allow users to pass images as separate arguments rather than keeping everything within the chat template?

I thought we were breaking it tho for v5? Or did I misunderstand something?

What they are planning to break for v5 in image-text-to-text is removing inputs like images for pipeline, so that users stick to chat template. See #42359

But perhaps the __call__ logic could still be simplified and shifted to base.py. Let me see...

vasqu · 2025-11-21T16:16:03Z

src/transformers/pipelines/text_to_audio.py

+        elif isinstance(audio, tuple):
+            waveform = audio[0]
        else:
-            waveform = self.processor.decode(audio)


Iirc it was for Dia but it has been broken way too often.

tests/pipelines/test_pipelines_text_to_audio.py

vasqu · 2025-11-21T16:22:19Z

Failing tests are either flaky (loftr one is handled elsewhere)

Co-authored-by: Anton Vlasjuk <[email protected]>

src/transformers/models/seamless_m4t/modeling_seamless_m4t.py

vasqu

Added some smaller comments, imo we should still check how we standardize (not necessarily in this PR but we should keep this in mind / discuss before we lose all sight)

I think 2 somewhat bigger things

seamless should be a different PR?
check if dia works

src/transformers/models/seamless_m4t/modeling_seamless_m4t.py

src/transformers/pipelines/text_to_audio.py

vasqu · 2025-11-24T16:42:03Z

src/transformers/pipelines/text_to_audio.py

+                text_inputs, _ = itertools.tee(text_inputs)
+                text_inputs, first_item = (x for x in text_inputs), next(_)


Let's exchange the wildcard tho, i.e. _ - trying to be explicit here

vasqu · 2025-11-24T16:50:52Z

src/transformers/pipelines/text_to_audio.py

+            if isinstance(text_inputs, types.GeneratorType):
+                text_inputs, _ = itertools.tee(text_inputs)
+                text_inputs, first_item = (x for x in text_inputs), next(_)
+            else:
+                first_item = text_inputs[0]
+            if isinstance(first_item, (list, tuple, dict)):
+                # We have one or more prompts in list-of-dicts format, so this is chat mode
+                if isinstance(first_item, dict):
+                    return super().__call__(Chat(text_inputs), **forward_params)
+                else:
+                    chats = (Chat(chat) for chat in text_inputs)
+                    if isinstance(text_inputs, types.GeneratorType):
+                        return super().__call__(chats, **forward_params)
+                    else:
+                        return super().__call__(list(chats), **forward_params)


I thought we were breaking it tho for v5? Or did I misunderstand something?

vasqu · 2025-11-24T16:54:37Z

src/transformers/pipelines/text_to_audio.py

+        elif isinstance(audio, tuple):
+            waveform = audio[0]
        else:
-            waveform = self.processor.decode(audio)


Reopening because I want to verify that Dia works with this current version - I'm pretty sure we need the processor to decode for Dia which is why I wrote the initial long message on how we plan to standardize

Everything handled by the model, audio tokenizer within it already

Separate model / tokenizer, processor handles encoding/decoding into codebooks/waveform

src/transformers/pipelines/text_to_audio.py

ebezzam

@vasqu thanks for your comments!

I tried going directly for an approach that puts common chat template logic into the base pipeline object. Let me know if we should rather focus just on text-to-audio and do such standardization in a separate PR.
Double-checked Dia. You're right self.processor.decode was needed for it, but I've adapted the (previous) logic so it doesn't assume such a call is needed for all models that have a processor (e.g. CSM and VibeVoice don't need to call this)

ebezzam · 2025-11-25T15:43:19Z

src/transformers/pipelines/base.py

+        # Detect if inputs is a chat-style input and cast as `Chat` or list of `Chat`
+        if isinstance(
+            inputs,
+            (list, tuple, types.GeneratorType, KeyDataset)
+            if is_torch_available()
+            else (list, tuple, types.GeneratorType),
+        ):
+            if isinstance(inputs, types.GeneratorType):
+                gen_copy1, gen_copy2 = itertools.tee(inputs)
+                inputs = (x for x in gen_copy1)
+                first_item = next(gen_copy2)
+            else:
+                first_item = inputs[0]
+            if isinstance(first_item, (list, tuple, dict)):
+                if isinstance(first_item, dict):
+                    inputs = Chat(inputs)
+                else:
+                    chats = (Chat(chat) for chat in inputs)
+                    if isinstance(inputs, types.GeneratorType):
+                        inputs = chats
+                    else:
+                        inputs = list(chats)


@vasqu an idea for standardizing chat template usage in the base class.

Essentially this was in text-generation and what I had copied in text-to-audio (in what you last saw), and could potentially be used by image-text-to-text once it drops support for the images input (@zucchini-nlp)?

Following test run as before:

RUN_SLOW=1 pytest tests/pipelines/test_pipelines_text_to_audio.py RUN_SLOW=1 pytest tests/pipelines/test_pipelines_text_generation.py RUN_SLOW=1 pytest tests/pipelines/test_pipelines_image_text_to_text.py

I could also run below if you think we should check to be safe? and do you have anything else in mind that I should double check?

RUN_SLOW=1 pytest tests/pipelines

I think our pipelines tests should be all not slow tests, but doesn't hurt to check with the env. I'm pro this!

Waiting for @zucchini-nlp if she has anything to add, if I see it correctly it's depending on #42359

src/transformers/pipelines/text_to_audio.py

ebezzam · 2025-11-25T15:50:05Z

tests/pipelines/test_pipelines_text_to_audio.py

+        # ensure audio and not codes
+        self.assertEqual(len(audio.shape), 1)


Added such checks to make sure audio decoding actually working

tests/pipelines/test_pipelines_text_to_audio.py

vasqu

I think it's already looking pretty good. We should sync with the other pipelines PR not to have some weird conflicting behavior

Other than that, we really should work on enforcing good standards so that we do not have to add so many exceptions (especially now with CSM/Dia). With v5, we IMO have the opportunity to break things and make it unified cc @eustlb wdyt?

vasqu · 2025-11-25T16:27:41Z

src/transformers/pipelines/base.py

+        # Detect if inputs is a chat-style input and cast as `Chat` or list of `Chat`
+        if isinstance(
+            inputs,
+            (list, tuple, types.GeneratorType, KeyDataset)
+            if is_torch_available()
+            else (list, tuple, types.GeneratorType),
+        ):
+            if isinstance(inputs, types.GeneratorType):
+                gen_copy1, gen_copy2 = itertools.tee(inputs)
+                inputs = (x for x in gen_copy1)
+                first_item = next(gen_copy2)
+            else:
+                first_item = inputs[0]
+            if isinstance(first_item, (list, tuple, dict)):
+                if isinstance(first_item, dict):
+                    inputs = Chat(inputs)
+                else:
+                    chats = (Chat(chat) for chat in inputs)
+                    if isinstance(inputs, types.GeneratorType):
+                        inputs = chats
+                    else:
+                        inputs = list(chats)


I think our pipelines tests should be all not slow tests, but doesn't hurt to check with the env. I'm pro this!

Waiting for @zucchini-nlp if she has anything to add, if I see it correctly it's depending on #42359

src/transformers/pipelines/text_to_audio.py

vasqu · 2025-11-25T16:32:24Z

src/transformers/pipelines/text_to_audio.py

+        elif isinstance(audio, tuple):
+            waveform = audio[0]
        else:
-            waveform = self.processor.decode(audio)


That's a good point on voice cloning, we should maybe update Dia with a chat template in the future. I did not have that in mind at that point, that's on me.

Re: standards. Yea, we have no choice atm - it's more of a question on how we handle this in the future

src/transformers/pipelines/text_to_audio.py

tests/pipelines/test_pipelines_text_to_audio.py

ebezzam

@vasqu thanks for previous comments, it's getting better!

No more exceptions (for newer audio models) in this iteration, so going forward we don't need to add exceptions for newer audio models if:

they properly use return_dict_in_generate
write audio into that the generation output dict, and otherwise codecs that need decoding into sequences

ebezzam · 2025-11-26T11:28:23Z

src/transformers/pipelines/base.py

+            if isinstance(first_item, dict):
+                if is_valid_chat(inputs):
+                    inputs = Chat(inputs)
+            elif isinstance(first_item, (list, tuple)):
+                # materialize generator is needed
+                items = list(inputs) if isinstance(inputs, types.GeneratorType) else inputs
+                if all(is_valid_chat(chat) for chat in items):
+                    chats = (Chat(chat) for chat in items)
+                    if isinstance(inputs, types.GeneratorType):
+                        inputs = chats
+                    else:
+                        inputs = list(chats)


Changed previous logic for backward compatibility: some pipelines pass a list of objects which are not necessarily a chat template (e.g. key point matching). So I only convert if the object is a valid chat.

cc @Rocketknight1 who has more experience with pipelines and may spot an edge case 🙃

FYI no new failures when I run RUN_SLOW=1 pytest tests/pipelines

I don't see anything it's missing! You could maybe simplify it by just always calling list(inputs) and removing the extra conditionals for generators, though? Python lists only store pointers to elements so it should be basically free in terms of speed/memory, even if the chats are big.

@Rocketknight1 yes I can cast as list from the start! fyi I'll have to remove this check which errors

Yep, makes sense to me! Once we're materializing the entire generator output there's not much point in pretending we're streaming anymore.

src/transformers/pipelines/text_to_audio.py

tests/pipelines/test_pipelines_text_to_audio.py

src/transformers/utils/chat_template_utils.py

vasqu

LGTM overall, just make to sure to sync with main and that nothing is broken based on that

The remaining comments are nits

src/transformers/pipelines/text_to_audio.py

src/transformers/utils/chat_template_utils.py

tests/pipelines/test_pipelines_text_to_audio.py

ebezzam

@vasqu I removed the exceptions for bark/musicgen in _forward. I needed to fix the handling of return_dict_in_generate in modeling_bark.py

ebezzam · 2025-11-26T16:44:05Z

src/transformers/models/bark/modeling_bark.py

+        if kwargs.get("return_dict_in_generate", False):
+            semantic_output = semantic_output.sequences[:, max_input_semantic_length + 1 :]
+        else:
+            semantic_output = semantic_output[:, max_input_semantic_length + 1 :]


This and below so that that Bark doesn't error out when return_dict_in_generate=True is passed

ebezzam · 2025-11-26T16:45:04Z

src/transformers/pipelines/text_to_audio.py

+            # ensure dict output to facilitate postprocessing
+            forward_params.update({"return_dict_in_generate": True})


No more exception 🙂

ebezzam · 2025-11-26T16:46:43Z

src/transformers/pipelines/text_to_audio.py

+        if needs_decoding and self.processor is not None:
+            audio = self.processor.decode(audio)


Just had to add self.processor is not None for music gen to pass. Makes sense since I set the processor to None 🤦

ebezzam · 2025-11-26T16:53:14Z

tests/pipelines/test_pipelines_text_to_audio.py


        outputs = music_generator("This is a test")
        self.assertEqual({"audio": ANY(np.ndarray), "sampling_rate": 32000}, outputs)
+        self.assertEqual(len(outputs["audio"].shape), n_ch)


added check for musicgen and bark to make sure ensure they are audio with cleaned postprocess and for future

github-actions · 2025-11-26T17:45:15Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: bark

Cyrilvallez

Thanks!

ebezzam added 3 commits November 21, 2025 14:27

Fix processor usage and add chat_template support to TTS pipeline.

d4f89a0

Fallback to tokenizer for musicgen.

f4dbca1

Fallback to tokenizer for musicgen.

25f4816

ebezzam added the Audio label Nov 21, 2025

ebezzam commented Nov 21, 2025

View reviewed changes

ebezzam added 3 commits November 21, 2025 15:45

Make style

29185d4

style/quality after update?

27a56cf

FIx copied from

a654122

ebezzam requested a review from vasqu November 21, 2025 15:03

Deep-unlearning mentioned this pull request Nov 21, 2025

update with more recent tts models #42328

Merged

vasqu reviewed Nov 21, 2025

View reviewed changes

ebezzam and others added 5 commits November 21, 2025 17:48

Smaller things.

13234d9

Update src/transformers/pipelines/text_to_audio.py

729844b

Co-authored-by: Anton Vlasjuk <[email protected]>

Shift common utilities to chat template utils.

bb96d9b

Merge branch 'main' into fix/tts_pipepine

7da8240

Type nits

f801b96

ebezzam requested a review from vasqu November 24, 2025 14:14

Remove tied weights.

716c3cf

ebezzam commented Nov 24, 2025

View reviewed changes

src/transformers/models/seamless_m4t/modeling_seamless_m4t.py Show resolved Hide resolved

vasqu reviewed Nov 24, 2025

View reviewed changes

ebezzam added 2 commits November 25, 2025 07:58

Keep seamless error

c0060d1

Better audio output object.

f8e90cf

ebezzam commented Nov 25, 2025

View reviewed changes

src/transformers/pipelines/text_to_audio.py Show resolved Hide resolved

ebezzam added the for_v5? label Nov 25, 2025

ebezzam added 2 commits November 25, 2025 11:45

Properly handle DIa and add test.

f5fb635

Shift chat template prep to base, test dia batch.

ed25458

ebezzam commented Nov 25, 2025

View reviewed changes

vasqu reviewed Nov 25, 2025

View reviewed changes

ebezzam added 2 commits November 26, 2025 12:07

Backward compatibility for dicts passed to pipelines

97158c6

Simplify postprocessing and tests.

5b80b03

ebezzam commented Nov 26, 2025

View reviewed changes

ebezzam added 4 commits November 26, 2025 13:42

Merge branch 'main' into fix/tts_pipepine

0b17759

Make quality/style

838daf7

Remove chat from image text to text.

e3b882d

Only check first item, to not consume first item of generator inputs.

661cafb

ebezzam commented Nov 26, 2025

View reviewed changes

src/transformers/utils/chat_template_utils.py Show resolved Hide resolved

Nit

94410f4

vasqu approved these changes Nov 26, 2025

View reviewed changes

ebezzam added 2 commits November 26, 2025 17:42

Simplify

79d7e73

Add checks for bark/musicgen to ensure output is audio.

7a438ff

ebezzam commented Nov 26, 2025

View reviewed changes

ebezzam added 2 commits November 26, 2025 18:10

Better var

1428092

Merge branch 'main' into fix/tts_pipepine

4fde390

ebezzam changed the title ~~Fix processor usage and add chat_template support to TTS pipeline.~~ Fix processor usage, add chat_template support to TTS pipeline, and shift common chat template logic to base class. Nov 26, 2025

ebezzam changed the title ~~Fix processor usage, add chat_template support to TTS pipeline, and shift common chat template logic to base class.~~ Fix processor usage + add chat_template support to TTS pipeline, and shift common chat template logic to base class. Nov 26, 2025

Merge branch 'main' into fix/tts_pipepine

80364e3

ebezzam mentioned this pull request Nov 26, 2025

🚨 Clean up image-text-to-text pipeline #42430

Open

1 task

Cyrilvallez approved these changes Nov 27, 2025

View reviewed changes

Cyrilvallez merged commit 5458d81 into huggingface:main Nov 27, 2025
21 of 23 checks passed

	def _is_chat(arg):
	return isinstance(arg, (list, tuple, KeyDataset)) and isinstance(arg[0], (list, tuple, dict))

	if _is_chat(text):
	# We have one or more prompts in list-of-dicts format, so this is chat mode
	if isinstance(text[0], dict):
	return super().__call__(Chat(text, images), **kwargs)
	else:
	if images is None:
	images = [None] * len(text)
	chats = [Chat(chat, image) for chat, image in zip(text, images)] # 🐈 🐈 🐈
	return super().__call__(chats, **kwargs)

	# Same as above, but the `images` argument contains the chat. This can happen e.g. is the user only passes a
	# chat as a positional argument.
	elif text is None and _is_chat(images):
	# We have one or more prompts in list-of-dicts format, so this is chat mode
	if isinstance(images[0], dict):
	return super().__call__(Chat(images), **kwargs)
	else:
	chats = [Chat(image) for image in images] # 🐈 🐈 🐈
	return super().__call__(chats, **kwargs)

		text_inputs, _ = itertools.tee(text_inputs)
		text_inputs, first_item = (x for x in text_inputs), next(_)

		# ensure audio and not codes
		self.assertEqual(len(audio.shape), 1)

		# ensure dict output to facilitate postprocessing
		forward_params.update({"return_dict_in_generate": True})

		if needs_decoding and self.processor is not None:
		audio = self.processor.decode(audio)

Fix processor usage + add chat_template support to TTS pipeline, and shift common chat template logic to base class. #42326

Fix processor usage + add chat_template support to TTS pipeline, and shift common chat template logic to base class. #42326

Uh oh!

Conversation

ebezzam commented Nov 21, 2025

What does this PR do?

Context for current error

After changes

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 21, 2025

Uh oh!

ebezzam commented Nov 21, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vasqu commented Nov 21, 2025

ebezzam Nov 25, 2025 •

edited

Loading

ebezzam Nov 25, 2025 •

edited

Loading

ebezzam Nov 21, 2025 •

edited

Loading

ebezzam Nov 25, 2025 •

edited

Loading