You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all thanks for this great model, it works really well! I now want to test something more advanced. I am trying to use this model hosted on a server. I think something like the code you provided with the Triton server is already pretty much what I want. Do I understand it right, that this supports batch inference? So for instance, if 8 clients make a request, the model can put out these 8 batches simultaneously?
If that's the case, I saw at the socket_server.py example that streaming is available. This "streaming" is basically just sending the generated audio in chunks and concatenating it, right? So the latency to the first audio is the generation time of the first batch of texts. This means, that finding the maximum possible batch size would be the size where the generation time of the batch is just slightly shorter than the generated audio itself (so the audio is fluid), while keeping the latency of the first finished batch with real-time expectation (e.g. under 2 seconds). Does the Triton server already handle this kind of concurrency? E.g. I have 8 clients that send a 100 second long text, will the batches be in such a way that the 8 clients are handled concurrently, so each could listen to the audio after the first batch of 8 is finished and the rest is then concatenated?
Thanks in advance for any answer!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
First of all thanks for this great model, it works really well! I now want to test something more advanced. I am trying to use this model hosted on a server. I think something like the code you provided with the Triton server is already pretty much what I want. Do I understand it right, that this supports batch inference? So for instance, if 8 clients make a request, the model can put out these 8 batches simultaneously?
If that's the case, I saw at the socket_server.py example that streaming is available. This "streaming" is basically just sending the generated audio in chunks and concatenating it, right? So the latency to the first audio is the generation time of the first batch of texts. This means, that finding the maximum possible batch size would be the size where the generation time of the batch is just slightly shorter than the generated audio itself (so the audio is fluid), while keeping the latency of the first finished batch with real-time expectation (e.g. under 2 seconds). Does the Triton server already handle this kind of concurrency? E.g. I have 8 clients that send a 100 second long text, will the batches be in such a way that the 8 clients are handled concurrently, so each could listen to the audio after the first batch of 8 is finished and the rest is then concatenated?
Thanks in advance for any answer!
Beta Was this translation helpful? Give feedback.
All reactions