Replies: 1 comment 2 replies
-
|
>>> naurto It is not an easy problem. You can get some insights from this paper [Archived Post] |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
>>> joshua.eisenberg
[May 7, 2019, 8:26pm]
Hey all!
Thankfully I have been able to get the pre-trained model up and running,
and producing great synthesized speech.
Some context: I want to animate a face / mouth to speak while the
synthesized audio is playing. In order to do this I need the start and
stop time of each phoneme in the synthesized speech.
I am wondering if it is possible to use the attention map to extract the
timings of then synthesized words? Once I have this I would like to
extract the timings of each phonemes...
I would like to analyze the attention map to do this even though I know
I could use an acoustic model to calculate this, but this is overkill,
and I thought it would be better to find a solution that's already in
the TTS library.
I originally posted on the git
hub, and
erogol suggested to look at the
attention maps. I'm also just wondering if there is a way to get the
image / data structure that contains the attention map of a synthesized
phrase, and analyze this to get the proper timings.
Thanks for any help!
😄
[This is an archived TTS discussion thread from discourse.mozilla.org/t/extract-timing-of-phonemes-and-words-from-attention-map]
Beta Was this translation helpful? Give feedback.
All reactions