-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Fix processor usage + add chat_template support to TTS pipeline, and shift common chat template logic to base class. #42326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ebezzam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vasqu Self-review with pointers that hopefully helps!
| # Copied from transformers.pipelines.text_generation | ||
| ChatType = list[dict[str, str]] | ||
|
|
||
|
|
||
| # Copied from transformers.pipelines.text_generation | ||
| class Chat: | ||
| """This class is intended to just be used internally in this pipeline and not exposed to users. We convert chats | ||
| to this format because the rest of the pipeline code tends to assume that lists of messages are | ||
| actually a batch of samples rather than messages in the same conversation.""" | ||
|
|
||
| def __init__(self, messages: dict): | ||
| for message in messages: | ||
| if not ("role" in message and "content" in message): | ||
| raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.") | ||
| self.messages = messages | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially coming from text generation: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_generation.py#L17
| if isinstance( | ||
| text_inputs, | ||
| (list, tuple, types.GeneratorType) | ||
| if is_torch_available() | ||
| else (list, tuple, types.GeneratorType), | ||
| ): | ||
| if isinstance(text_inputs, types.GeneratorType): | ||
| text_inputs, _ = itertools.tee(text_inputs) | ||
| text_inputs, first_item = (x for x in text_inputs), next(_) | ||
| else: | ||
| first_item = text_inputs[0] | ||
| if isinstance(first_item, (list, tuple, dict)): | ||
| # We have one or more prompts in list-of-dicts format, so this is chat mode | ||
| if isinstance(first_item, dict): | ||
| return super().__call__(Chat(text_inputs), **forward_params) | ||
| else: | ||
| chats = (Chat(chat) for chat in text_inputs) | ||
| if isinstance(text_inputs, types.GeneratorType): | ||
| return super().__call__(chats, **forward_params) | ||
| else: | ||
| return super().__call__(list(chats), **forward_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly from text generation:
| if isinstance( |
| elif isinstance(audio, tuple): | ||
| waveform = audio[0] | ||
| else: | ||
| waveform = self.processor.decode(audio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was breaking for CSM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iirc it was for Dia but it has been broken way too often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reopening because I want to verify that Dia works with this current version - I'm pretty sure we need the processor to decode for Dia which is why I wrote the initial long message on how we plan to standardize
- Everything handled by the model, audio tokenizer within it already
- Separate model / tokenizer, processor handles encoding/decoding into codebooks/waveform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, Dia does need the processor for decoding and I'll also add a unit test for Dia so we don't miss this in the future.
However, I feel like a blanket self.processor.decode might be too broad. For example, CSM and VibeVoice don't require the processor to decode.
Since there is not standard approach (yet), how about something like below (which is working):
if isinstance(audio, dict):
waveform = audio[waveform_key]
elif isinstance(audio, tuple):
waveform = audio[0]
elif self.model.config.model_type in ["dia"]:
# models that require decoding, e.g. with codec
waveform = self.processor.decode(audio)
else:
waveform = audioThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example usage:
from transformers import pipeline
import soundfile as sf
dia_pipeline = pipeline(
"text-to-audio", model=model_checkpoint,
)
outputs = dia_pipeline(
"[S1] Dia is an open weights text to dialogue model.",
generate_kwargs={"max_new_tokens": 512},
)
assert outputs["sampling_rate"] == 44100
audio = outputs["audio"].squeeze()
fn = "dia_pipeline_output.wav"
sf.write(fn, audio, outputs["sampling_rate"])
print(f"Audio saved to {fn}")I'm reluctant to allow voice cloning through pipeline, as this would require passing an audios input to pipeline (since Dia doesn't support chat templates).
Moreover, allowing inputs like audios is exactly what they are trying to phase out with image-text-to-text in #42359 (to only support chat template usage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point on voice cloning, we should maybe update Dia with a chat template in the future. I did not have that in mind at that point, that's on me.
Re: standards. Yea, we have no choice atm - it's more of a question on how we handle this in the future
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Failed tests are unrelated, concatenating below: |
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments 🤗 I think it's overall fine more so nits to have more alignment with other pipelines.
My only gripe is that we still don't have a standard way how models generate audio. For example, CSM directly generates the audio waveform but that's because it uses the audio tokenizer directly within the model itself. Dia does not do that and depends on the processor do decode into waveform. This is something we have to properly enfore at some point before we get too many exceptions cc @eustlb
| # Copied from transformers.pipelines.text_generation.Chat | ||
| class Chat: | ||
| """This class is intended to just be used internally in this pipeline and not exposed to users. We convert chats | ||
| to this format because the rest of the pipeline code tends to assume that lists of messages are | ||
| actually a batch of samples rather than messages in the same conversation.""" | ||
|
|
||
| def __init__(self, messages: dict): | ||
| for message in messages: | ||
| if not ("role" in message and "content" in message): | ||
| raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.") | ||
| self.messages = messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like chat templates with any modality are increasingly more important, might be better to move this somewhere more general (and let it be imported). We gonna have audio, image, text already atp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I would like to avoid copied from tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently two other pipelines using this:
- text generation
- image-text-to-text, which calls additional
add_images_to_messages
We could combine into a single Chat object in base.py like so?
class Chat:
"""This class is intended to just be used internally in this pipeline and not exposed to users. We convert chats
to this format because the rest of the pipeline code tends to assume that lists of messages are
actually a batch of samples rather than messages in the same conversation."""
def __init__(
self, messages: dict, images: Union[str, list[str], "Image.Image", list["Image.Image"]] | None = None
):
for message in messages:
if not ("role" in message and "content" in message):
raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.")
if images is not None:
messages = add_images_to_messages(messages, images)
self.messages = messagesFor audio models with @eustlb, our chat templates already allow audio in the message and the jinja template (like this and this) handles extracting audio when calling apply_chat_template (which then calls the processor internally, see here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, @Rocketknight1 @zucchini-nlp wdyt about this? Not sure who has a better overview on (image) pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can move it to base.py or utils/chat_template_utils.py, whichever is a cleaner import. Not a huge issue either way, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @Rocketknight1! I've put in utils/chat_template_utils.py for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For image-text-to-text's current Chat object (here), it could be nice to do something like below in the general purpose Chat to support the images input:
class Chat:
"""This class is intended to just be used internally for pipelines and not exposed to users. We convert chats
to this format because the rest of the pipeline code tends to assume that lists of messages are
actually a batch of samples rather than messages in the same conversation."""
def __init__(self, messages: dict, images: Union[str, list[str], "Image.Image", list["Image.Image"]] | None = None):
for message in messages:
if not ("role" in message and "content" in message):
raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.")
if images is not None:
messages = add_images_to_messages(messages, images)
self.messages = messagesBut actually this fails for a current usage 👉 when there is an image URL in the chat template (this code path).
FYI, I found out above this edge case, because this test would fail when I tried a modified Chat object like above (because the image wouldn't be properly loaded).
@zucchini-nlp do you have an idea of how image-text-to-text could also use the general purpose Chat object without having to call add_images_to_messages to avoid the current double for-loop if there is indeed no image input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vasqu, @Rocketknight1 for reference @zucchini-nlp started a PR to handle image-text-to-text so I won't touch it in this PR
| if isinstance(text_inputs, types.GeneratorType): | ||
| text_inputs, _ = itertools.tee(text_inputs) | ||
| text_inputs, first_item = (x for x in text_inputs), next(_) | ||
| else: | ||
| first_item = text_inputs[0] | ||
| if isinstance(first_item, (list, tuple, dict)): | ||
| # We have one or more prompts in list-of-dicts format, so this is chat mode | ||
| if isinstance(first_item, dict): | ||
| return super().__call__(Chat(text_inputs), **forward_params) | ||
| else: | ||
| chats = (Chat(chat) for chat in text_inputs) | ||
| if isinstance(text_inputs, types.GeneratorType): | ||
| return super().__call__(chats, **forward_params) | ||
| else: | ||
| return super().__call__(list(chats), **forward_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly like this, it's short and sweet but I feel like we should maybe align with image e.g.
transformers/src/transformers/pipelines/image_text_to_text.py
Lines 331 to 352 in ce7a5e0
| def _is_chat(arg): | |
| return isinstance(arg, (list, tuple, KeyDataset)) and isinstance(arg[0], (list, tuple, dict)) | |
| if _is_chat(text): | |
| # We have one or more prompts in list-of-dicts format, so this is chat mode | |
| if isinstance(text[0], dict): | |
| return super().__call__(Chat(text, images), **kwargs) | |
| else: | |
| if images is None: | |
| images = [None] * len(text) | |
| chats = [Chat(chat, image) for chat, image in zip(text, images)] # 🐈 🐈 🐈 | |
| return super().__call__(chats, **kwargs) | |
| # Same as above, but the `images` argument contains the chat. This can happen e.g. is the user only passes a | |
| # chat as a positional argument. | |
| elif text is None and _is_chat(images): | |
| # We have one or more prompts in list-of-dicts format, so this is chat mode | |
| if isinstance(images[0], dict): | |
| return super().__call__(Chat(images), **kwargs) | |
| else: | |
| chats = [Chat(image) for image in images] # 🐈 🐈 🐈 | |
| return super().__call__(chats, **kwargs) |
We cooking all our own soup 😢
Only thing I'd change would be avoid using a wildcard _ and give an explicit name instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on Slack, the __call__ logic of image-text-to-text may be more complicated because they allow users to pass images as separate arguments rather than keeping everything within the chat template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were breaking it tho for v5? Or did I misunderstand something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What they are planning to break for v5 in image-text-to-text is removing inputs like images for pipeline, so that users stick to chat template. See #42359
But perhaps the __call__ logic could still be simplified and shifted to base.py. Let me see...
| elif isinstance(audio, tuple): | ||
| waveform = audio[0] | ||
| else: | ||
| waveform = self.processor.decode(audio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iirc it was for Dia but it has been broken way too often.
|
Failing tests are either flaky (loftr one is handled elsewhere) |
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some smaller comments, imo we should still check how we standardize (not necessarily in this PR but we should keep this in mind / discuss before we lose all sight)
I think 2 somewhat bigger things
- seamless should be a different PR?
- check if dia works
| text_inputs, _ = itertools.tee(text_inputs) | ||
| text_inputs, first_item = (x for x in text_inputs), next(_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's exchange the wildcard tho, i.e. _ - trying to be explicit here
| if isinstance(text_inputs, types.GeneratorType): | ||
| text_inputs, _ = itertools.tee(text_inputs) | ||
| text_inputs, first_item = (x for x in text_inputs), next(_) | ||
| else: | ||
| first_item = text_inputs[0] | ||
| if isinstance(first_item, (list, tuple, dict)): | ||
| # We have one or more prompts in list-of-dicts format, so this is chat mode | ||
| if isinstance(first_item, dict): | ||
| return super().__call__(Chat(text_inputs), **forward_params) | ||
| else: | ||
| chats = (Chat(chat) for chat in text_inputs) | ||
| if isinstance(text_inputs, types.GeneratorType): | ||
| return super().__call__(chats, **forward_params) | ||
| else: | ||
| return super().__call__(list(chats), **forward_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were breaking it tho for v5? Or did I misunderstand something?
| elif isinstance(audio, tuple): | ||
| waveform = audio[0] | ||
| else: | ||
| waveform = self.processor.decode(audio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reopening because I want to verify that Dia works with this current version - I'm pretty sure we need the processor to decode for Dia which is why I wrote the initial long message on how we plan to standardize
- Everything handled by the model, audio tokenizer within it already
- Separate model / tokenizer, processor handles encoding/decoding into codebooks/waveform
ebezzam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vasqu thanks for your comments!
- I tried going directly for an approach that puts common chat template logic into the base pipeline object. Let me know if we should rather focus just on
text-to-audioand do such standardization in a separate PR. - Double-checked Dia. You're right
self.processor.decodewas needed for it, but I've adapted the (previous) logic so it doesn't assume such a call is needed for all models that have a processor (e.g. CSM and VibeVoice don't need to call this)
src/transformers/pipelines/base.py
Outdated
| # Detect if inputs is a chat-style input and cast as `Chat` or list of `Chat` | ||
| if isinstance( | ||
| inputs, | ||
| (list, tuple, types.GeneratorType, KeyDataset) | ||
| if is_torch_available() | ||
| else (list, tuple, types.GeneratorType), | ||
| ): | ||
| if isinstance(inputs, types.GeneratorType): | ||
| gen_copy1, gen_copy2 = itertools.tee(inputs) | ||
| inputs = (x for x in gen_copy1) | ||
| first_item = next(gen_copy2) | ||
| else: | ||
| first_item = inputs[0] | ||
| if isinstance(first_item, (list, tuple, dict)): | ||
| if isinstance(first_item, dict): | ||
| inputs = Chat(inputs) | ||
| else: | ||
| chats = (Chat(chat) for chat in inputs) | ||
| if isinstance(inputs, types.GeneratorType): | ||
| inputs = chats | ||
| else: | ||
| inputs = list(chats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vasqu an idea for standardizing chat template usage in the base class.
Essentially this was in text-generation and what I had copied in text-to-audio (in what you last saw), and could potentially be used by image-text-to-text once it drops support for the images input (@zucchini-nlp)?
Following test run as before:
RUN_SLOW=1 pytest tests/pipelines/test_pipelines_text_to_audio.py
RUN_SLOW=1 pytest tests/pipelines/test_pipelines_text_generation.py
RUN_SLOW=1 pytest tests/pipelines/test_pipelines_image_text_to_text.py
I could also run below if you think we should check to be safe? and do you have anything else in mind that I should double check?
RUN_SLOW=1 pytest tests/pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our pipelines tests should be all not slow tests, but doesn't hurt to check with the env. I'm pro this!
Waiting for @zucchini-nlp if she has anything to add, if I see it correctly it's depending on #42359
| # ensure audio and not codes | ||
| self.assertEqual(len(audio.shape), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added such checks to make sure audio decoding actually working
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's already looking pretty good. We should sync with the other pipelines PR not to have some weird conflicting behavior
Other than that, we really should work on enforcing good standards so that we do not have to add so many exceptions (especially now with CSM/Dia). With v5, we IMO have the opportunity to break things and make it unified cc @eustlb wdyt?
src/transformers/pipelines/base.py
Outdated
| # Detect if inputs is a chat-style input and cast as `Chat` or list of `Chat` | ||
| if isinstance( | ||
| inputs, | ||
| (list, tuple, types.GeneratorType, KeyDataset) | ||
| if is_torch_available() | ||
| else (list, tuple, types.GeneratorType), | ||
| ): | ||
| if isinstance(inputs, types.GeneratorType): | ||
| gen_copy1, gen_copy2 = itertools.tee(inputs) | ||
| inputs = (x for x in gen_copy1) | ||
| first_item = next(gen_copy2) | ||
| else: | ||
| first_item = inputs[0] | ||
| if isinstance(first_item, (list, tuple, dict)): | ||
| if isinstance(first_item, dict): | ||
| inputs = Chat(inputs) | ||
| else: | ||
| chats = (Chat(chat) for chat in inputs) | ||
| if isinstance(inputs, types.GeneratorType): | ||
| inputs = chats | ||
| else: | ||
| inputs = list(chats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our pipelines tests should be all not slow tests, but doesn't hurt to check with the env. I'm pro this!
Waiting for @zucchini-nlp if she has anything to add, if I see it correctly it's depending on #42359
| elif isinstance(audio, tuple): | ||
| waveform = audio[0] | ||
| else: | ||
| waveform = self.processor.decode(audio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point on voice cloning, we should maybe update Dia with a chat template in the future. I did not have that in mind at that point, that's on me.
Re: standards. Yea, we have no choice atm - it's more of a question on how we handle this in the future
ebezzam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vasqu thanks for previous comments, it's getting better!
No more exceptions (for newer audio models) in this iteration, so going forward we don't need to add exceptions for newer audio models if:
- they properly use
return_dict_in_generate - write
audiointo that the generation output dict, and otherwise codecs that need decoding intosequences
src/transformers/pipelines/base.py
Outdated
| if isinstance(first_item, dict): | ||
| if is_valid_chat(inputs): | ||
| inputs = Chat(inputs) | ||
| elif isinstance(first_item, (list, tuple)): | ||
| # materialize generator is needed | ||
| items = list(inputs) if isinstance(inputs, types.GeneratorType) else inputs | ||
| if all(is_valid_chat(chat) for chat in items): | ||
| chats = (Chat(chat) for chat in items) | ||
| if isinstance(inputs, types.GeneratorType): | ||
| inputs = chats | ||
| else: | ||
| inputs = list(chats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed previous logic for backward compatibility: some pipelines pass a list of objects which are not necessarily a chat template (e.g. key point matching). So I only convert if the object is a valid chat.
cc @Rocketknight1 who has more experience with pipelines and may spot an edge case 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI no new failures when I run RUN_SLOW=1 pytest tests/pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything it's missing! You could maybe simplify it by just always calling list(inputs) and removing the extra conditionals for generators, though? Python lists only store pointers to elements so it should be basically free in terms of speed/memory, even if the chats are big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rocketknight1 yes I can cast as list from the start! fyi I'll have to remove this check which errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, makes sense to me! Once we're materializing the entire generator output there's not much point in pretending we're streaming anymore.
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just make to sure to sync with main and that nothing is broken based on that
The remaining comments are nits
ebezzam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vasqu I removed the exceptions for bark/musicgen in _forward. I needed to fix the handling of return_dict_in_generate in modeling_bark.py
| if kwargs.get("return_dict_in_generate", False): | ||
| semantic_output = semantic_output.sequences[:, max_input_semantic_length + 1 :] | ||
| else: | ||
| semantic_output = semantic_output[:, max_input_semantic_length + 1 :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and below so that that Bark doesn't error out when return_dict_in_generate=True is passed
| # ensure dict output to facilitate postprocessing | ||
| forward_params.update({"return_dict_in_generate": True}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more exception 🙂
| if needs_decoding and self.processor is not None: | ||
| audio = self.processor.decode(audio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had to add self.processor is not None for music gen to pass. Makes sense since I set the processor to None 🤦
|
|
||
| outputs = music_generator("This is a test") | ||
| self.assertEqual({"audio": ANY(np.ndarray), "sampling_rate": 32000}, outputs) | ||
| self.assertEqual(len(outputs["audio"].shape), n_ch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added check for musicgen and bark to make sure ensure they are audio with cleaned postprocess and for future
|
[For maintainers] Suggested jobs to run (before merge) run-slow: bark |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
What does this PR do?
Processor usage within TTS pipelines is faulty.
Moreover, it does not support chat template inputs with/without audio, which is of interest for the following models for generating conversations:
Example usage for CSM and inputs like such for VibeVoice:
This PR is related to #39796, but this one is different/simpler (and I'd say more urgent) in its objective 👉 fixing processor usage and enabling chat_template inputs like above.
So I think it's worth a separate PR as #39796 requires more testing/review.
Context for current error
This line fails when trying to use the processor because it isn’t loaded (so it is
Noneinstead)Minimal failing example:
After changes
No need to explicitly ask for processor usage (auto-detected internally).
TTS example
Conversation example with chat template, which was not possible before! Again auto-detected like in text-generation pipeline. Below example mimics this usage from CSM:
@Rocketknight1 since it is pipeline related you can take a look if you want! but will definitely ask @vasqu an @eustlb for audio-specific feedback 🙂