Streaming with batched inference #1003

Baytro · 2025-04-29T11:48:40Z

Baytro
Apr 29, 2025

First of all thanks for this great model, it works really well! I now want to test something more advanced. I am trying to use this model hosted on a server. I think something like the code you provided with the Triton server is already pretty much what I want. Do I understand it right, that this supports batch inference? So for instance, if 8 clients make a request, the model can put out these 8 batches simultaneously?
If that's the case, I saw at the socket_server.py example that streaming is available. This "streaming" is basically just sending the generated audio in chunks and concatenating it, right? So the latency to the first audio is the generation time of the first batch of texts. This means, that finding the maximum possible batch size would be the size where the generation time of the batch is just slightly shorter than the generated audio itself (so the audio is fluid), while keeping the latency of the first finished batch with real-time expectation (e.g. under 2 seconds). Does the Triton server already handle this kind of concurrency? E.g. I have 8 clients that send a 100 second long text, will the batches be in such a way that the 8 clients are handled concurrently, so each could listen to the audio after the first batch of 8 is finished and the rest is then concatenated?
Thanks in advance for any answer!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streaming with batched inference #1003

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Streaming with batched inference #1003

Uh oh!

Uh oh!

Baytro Apr 29, 2025

Replies: 0 comments

Baytro
Apr 29, 2025