End detection updates and Text context remapping during training #14569

paarthneekhara · 2025-08-25T07:18:55Z

This PR introduces two main updates.

EOS and Finished/unfinished sentences:

Added an option to disable finished/unfinished sentence tracking when prior is applied. Sometimes I notice forcing or disallowing EOS prediction introduces artifacts towards the end, unless we really finetune the unifinished and finished sentences constants. I am seeing we don't really need to handhold whether the sentence is finshed or not and can trust the model's decision, even when we apply the prior. So I have added an argument to not force or disallow EOS prediction by setting ignore_finished_sentence_tracking=True.
EOS detection - Right now, the logic was that if any codebook in multinomial or argmax sampling has an EOS token predicted, we predict the end. VERY RARELY, I notice this leads to predicting EOS abruptly. At some point, our logic was that if argmax of codebook 0 is EOS, we predict the end. I have added a few options - to predict EOS is any, all or zeroth codebook is EOS. And whether to only look at argmax sampling, or either of argmax or multinomial sampling. Keeping this customizable, because I suspect that parallel prediction EOS logic might be different from MaskGIT EOS prediction logic.

Added an option to map text contexts to a common text context during training (with a probability). We can give a json containing a dictionary mapping stating which text contexts should be mapped to some other text context. For example Lindy_calm, Lindy_angry mapped to Lindy_all. Then during training, we will do this mapping with a supplied probability. Main objective here is, for a speaker, we can learn all emotions and expect the model to infer the emotion from the transcript during inference.

Signed-off-by: Paarth Neekhara <[email protected]>

Signed-off-by: paarthneekhara <[email protected]>

rfejgin · 2025-08-25T21:20:51Z

Regarding EOS detection: that's an interesting that detecting from all codebooks can lead to early termination -- good find!

About the option detecting from codebook 0 only: there is the issue of what happens if an EOS appears only in a codebook other than 0 and we ignore it (for EOS detection purposes). Then we'd end up replacing the token with token ID 0 before decoding with the codec, but we know 0 doesn't necessarily correspond to codec silence.

But maybe we need to consider these two mechanisms separately:

EOS detection: use whatever method works best (e.g. from the set you implemented here)
"cleaning" of tokens before returning to user/codec: replace EOS tokens with "silent" tokens. We'd need to know, per codec release, which tokens it maps to silence. The complication is that we may have a frame of mixed silence and non-silence tokens; but it's probably still better to use a "quiet" token for the codebook that contained the EOS than arbitrarily choose 0. Anyway, for this mechanism to work we need to query the codec at init for silence tokens (e.g. by encoding a bit of digital silence), or keep track of that per codec release. I prefer the latter so that we don't need to include a codec encoder in our releases. Maybe we should require a get_silence_tokens() to be added to the the codec API. This would only have to be computed once per codec release and should be pretty easy for the codec maintainer to do.

If (2) works better that'll give (1) more flexibility.

nemo/collections/tts/modules/magpietts_modules.py

nemo/collections/tts/data/text_to_speech_dataset.py

rfejgin · 2025-08-25T21:55:07Z

Cool addition of the text remapping logic, by the way!

paarthneekhara · 2025-08-26T07:03:17Z

Regarding EOS detection: that's an interesting that detecting from all codebooks can lead to early termination -- good find!

About the option detecting from codebook 0 only: there is the issue of what happens if an EOS appears only in a codebook other than 0 and we ignore it (for EOS detection purposes). Then we'd end up replacing the token with token ID 0 before decoding with the codec, but we know 0 doesn't necessarily correspond to codec silence.

But maybe we need to consider these two mechanisms separately:

EOS detection: use whatever method works best (e.g. from the set you implemented here)

"cleaning" of tokens before returning to user/codec: replace EOS tokens with "silent" tokens. We'd need to know, per codec release, which tokens it maps to silence. The complication is that we may have a frame of mixed silence and non-silence tokens; but it's probably still better to use a "quiet" token for the codebook that contained the EOS than arbitrarily choose 0. Anyway, for this mechanism to work we need to query the codec at init for silence tokens (e.g. by encoding a bit of digital silence), or keep track of that per codec release. I prefer the latter so that we don't need to include a codec encoder in our releases. Maybe we should require a get_silence_tokens() to be added to the the codec API. This would only have to be computed once per codec release and should be pretty easy for the codec maintainer to do.

If (2) works better that'll give (1) more flexibility.

That's a good point. We should figure out a better way to clean up the codes. Are you suggesting if we want to clamp a partucular codebook to be within the range, we should make the whole frame silent or just that codebook.

rfejgin · 2025-08-26T21:24:59Z

That's a good point. We should figure out a better way to clean up the codes. Are you suggesting if we want to clamp a particular codebook to be within the range, we should make the whole frame silent or just that codebook.

I think it probably requires experimentation, e.g. by listening to find out what sounds better of the options we're considering:

Replacing the particular codebook with token 0 (what we have now)
Replace all codebooks with their corresponding silence tokens
Replace only the particular codebook with its silence token

One way to choose would be to load some codes of a real speech signal and randomly choose a few positions (frame+codebooks combinations) that we will corrupt with the 3 methods above; listen to all 3 and choose the one that sounds best.

rfejgin · 2025-08-27T22:51:55Z

That's a good point. We should figure out a better way to clean up the codes. Are you suggesting if we want to clamp a particular codebook to be within the range, we should make the whole frame silent or just that codebook.

I think it probably requires experimentation, e.g. by listening to find out what sounds better of the options we're considering:

Replacing the particular codebook with token 0 (what we have now)

Replace all codebooks with their corresponding silence tokens

Replace only the particular codebook with its silence tokens

One way to choose would be to load some codes of a real speech signal and randomly choose a few positions (frame+codebooks combinations) that we will corrupt with the 3 methods above; listen to all 3 and choose the one that sounds best.

An alternative that came to mind: if we detected an EOS which we want to ignore, e.g. because it's not in codebook 0, and we need to replace with something: resample. It wouldn't require another forward pass and should give something plausible (possibly better than silence). But we'd have to make sure not to get stuck in an infinite resampling loop (could limit how many times we reasample, then accept EOS if it insists...).

Edit: On second thought, the resampling option only works for parallel prediction. If we have a local transformer we'd have to do something else, like either run the LT again (which is higher complexity than resampling), replace with silence, or maybe just resample the particular codebook from the parallel head.

blisc

Some minor style changes

blisc · 2025-08-28T21:35:56Z

nemo/collections/tts/models/magpietts.py

@@ -1997,7 +2013,7 @@ def get_inference_attention_plots(

        return cross_attention_maps, headwise_cross_attention_maps

-    def find_eos_frame_index(self, codes) -> Optional[int]:
+    def find_eos_frame_index(self, codes, eos_detection_method) -> Optional[int]:


Since you changed the default return of this function, can you update the typehint and docstring to more accurately match the updated function?

blisc · 2025-08-28T21:36:46Z

nemo/collections/tts/models/magpietts.py

-        return None
+        return float('inf')
+
+    def detect_eos(self, audio_codes_multinomial, audio_codes_argmax, eos_detection_method):


Can you add typehints and a docstring to this function?

shehzeen · 2025-08-31T08:36:16Z

nemo/collections/tts/models/magpietts.py

    ):
+


We should add a docstring to infer_batch since there any many arguments in this function now.

shehzeen · 2025-08-31T08:37:57Z

We should update the infer_batch calls in magpietts_preference_optimization.py to use the right arguments for end detection logic during GRPO/DPO data generation.

paarthneekhara added 2 commits August 23, 2025 03:35

add text context remapping during training

32835a9

Signed-off-by: Paarth Neekhara <[email protected]>

EOS detection method made customizable

d1c411a

Signed-off-by: Paarth Neekhara <[email protected]>

github-actions bot added the TTS label Aug 25, 2025

connect with infer eval

6b3e97f

Signed-off-by: Paarth Neekhara <[email protected]>

paarthneekhara requested review from blisc, shehzeen and rfejgin August 25, 2025 07:31

paarthneekhara added the Run CICD label Aug 25, 2025

paarthneekhara changed the base branch from magpietts_2503 to magpietts_2508 August 25, 2025 15:31

Merge branch 'magpietts_2508' into magpietts_2503_combineTC

546d84f

Signed-off-by: Paarth Neekhara <[email protected]>

ko3n1g added Run CICD and removed Run CICD labels Aug 25, 2025

Apply isort and black reformatting

fb6d4c1

Signed-off-by: paarthneekhara <[email protected]>

ko3n1g added Run CICD and removed Run CICD labels Aug 25, 2025

Merge branch 'magpietts_2508' into magpietts_2503_combineTC

be57f8f

ko3n1g added Run CICD and removed Run CICD labels Aug 25, 2025

ko3n1g temporarily deployed to test August 25, 2025 17:33 — with GitHub Actions Inactive

rfejgin reviewed Aug 25, 2025

View reviewed changes

nemo/collections/tts/modules/magpietts_modules.py Show resolved Hide resolved

rfejgin reviewed Aug 25, 2025

View reviewed changes

nemo/collections/tts/data/text_to_speech_dataset.py Show resolved Hide resolved

Merge branch 'magpietts_2508' into magpietts_2503_combineTC

9a1d1db

ko3n1g added Run CICD and removed Run CICD labels Aug 28, 2025

ko3n1g temporarily deployed to test August 28, 2025 12:58 — with GitHub Actions Inactive

blisc requested changes Aug 28, 2025

View reviewed changes

shehzeen reviewed Aug 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

End detection updates and Text context remapping during training #14569

End detection updates and Text context remapping during training #14569

Uh oh!

paarthneekhara commented Aug 25, 2025

Uh oh!

rfejgin commented Aug 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

rfejgin commented Aug 25, 2025

Uh oh!

paarthneekhara commented Aug 26, 2025

Uh oh!

rfejgin commented Aug 26, 2025 •

edited

Loading

Uh oh!

rfejgin commented Aug 27, 2025 •

edited

Loading

Uh oh!

blisc left a comment

Uh oh!

blisc Aug 28, 2025

Uh oh!

blisc Aug 28, 2025

Uh oh!

shehzeen Aug 31, 2025

Uh oh!

shehzeen commented Aug 31, 2025

Uh oh!

Uh oh!

End detection updates and Text context remapping during training #14569

Are you sure you want to change the base?

End detection updates and Text context remapping during training #14569

Uh oh!

Conversation

paarthneekhara commented Aug 25, 2025

Uh oh!

rfejgin commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rfejgin commented Aug 25, 2025

Uh oh!

paarthneekhara commented Aug 26, 2025

Uh oh!

rfejgin commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rfejgin commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blisc left a comment

Choose a reason for hiding this comment

Uh oh!

blisc Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

blisc Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

shehzeen Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

shehzeen commented Aug 31, 2025

Uh oh!

Uh oh!

rfejgin commented Aug 25, 2025 •

edited

Loading

rfejgin commented Aug 26, 2025 •

edited

Loading

rfejgin commented Aug 27, 2025 •

edited

Loading