Extract timing of phonemes and words from attention map #23

JRMeyer · 2021-03-07T00:45:53Z

JRMeyer
Mar 7, 2021
Maintainer

>>> joshua.eisenberg
[May 7, 2019, 8:26pm]

Hey all!

Thankfully I have been able to get the pre-trained model up and running,
and producing great synthesized speech.

Some context: I want to animate a face / mouth to speak while the
synthesized audio is playing. In order to do this I need the start and
stop time of each phoneme in the synthesized speech.

I am wondering if it is possible to use the attention map to extract the
timings of then synthesized words? Once I have this I would like to
extract the timings of each phonemes...

I would like to analyze the attention map to do this even though I know
I could use an acoustic model to calculate this, but this is overkill,
and I thought it would be better to find a solution that's already in
the TTS library.

I originally posted on the git
hub, and
erogol suggested to look at the
attention maps. I'm also just wondering if there is a way to get the
image / data structure that contains the attention map of a synthesized
phrase, and analyze this to get the proper timings.

Thanks for any help!
😄

[This is an archived TTS discussion thread from discourse.mozilla.org/t/extract-timing-of-phonemes-and-words-from-attention-map]

JRMeyer · 2021-03-07T00:55:56Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> naurto
[May 13, 2019, 1:36pm]

It is not an easy problem. You can get some insights from this paper
" Phonemic-level Duration Control Using Attention Alignment for Natural Speech Synthesis"

[Archived Post]

2 replies

JRMeyer Mar 7, 2021
Maintainer Author

>>> nmstoker
[July 11, 2020, 8:01pm]

did you ever get anywhere with this?

It's an interesting thing to look into and I was thinking about this
earlier.

If you have a look at this notebook
you'll be able to explore the alignment charts - if I'm interpretting
them right then one dimension of "alignment" (the np array) is the
phonemes and the other is the timesteps in the audio output.

This isn't (yet) quite what you seek but if you could match the text
word starts and ends to their corresponding phonemes then you would be
able to turn that into approximate timings based on the attention map
line. Thus, for instance, if you know the second word starts at phoneme
symbol eight, you effectively read across from eight to the attention
line and then see what point in the time line that is equivalent to and
that should be approximately the start time of the second word.

[Archived Post]

JRMeyer Mar 7, 2021
Maintainer Author

>>> erogol
[July 13, 2020, 8:48am]

and if you like to do this for the real data you need to run the model
in teacher forcing mode (providing real spectrograms to prenet as in
training pass) so align the real audio with the attention alignment.

[Archived Post]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract timing of phonemes and words from attention map #23

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extract timing of phonemes and words from attention map #23

Uh oh!

Uh oh!

JRMeyer Mar 7, 2021 Maintainer

Replies: 1 comment · 2 replies

Uh oh!

JRMeyer Mar 7, 2021 Maintainer Author

Uh oh!

Uh oh!

JRMeyer Mar 7, 2021 Maintainer Author

Uh oh!

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

Replies: 1 comment 2 replies

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer Mar 7, 2021
Maintainer Author

JRMeyer Mar 7, 2021
Maintainer Author