Skip to content

Various optimizations and reliability improvements#658

Open
docteurZ wants to merge 31 commits into
mainfrom
optim_kyutai
Open

Various optimizations and reliability improvements#658
docteurZ wants to merge 31 commits into
mainfrom
optim_kyutai

Conversation

@docteurZ
Copy link
Copy Markdown
Collaborator

@docteurZ docteurZ commented Feb 7, 2026

No description provided.

docteurZ and others added 20 commits January 20, 2026 15:43
…torch. Didn't manage to it in requirement.txt without pulling 2Gb of Cuda
- WebRTC VAD (default): Fast, lightweight, good for most cases
- Silero VAD: More accurate, especially for noisy environments
…ering

- Pluggable VAD: WebRTC (default) or Silero via VAD_PROVIDER env var
- Trim trailing silence from utterances before saving
- Filter low-quality utterances: min 15% speech ratio, min 200ms duration
…cing false cutoffs and

utterance fragmentation.
…ic VAD on the server side). add a basic RMS energy filter to avoid sending pure silence though.
…ion. Streaming providers (Deepgram, Kyutai) have their own VAD in the streaming manager. Async transcription uses audio_chunk_buffer_manager inside the streaming manager
…tually spoke), not when Kyutai returned the transcription 500ms later.
The health check only detected initial connection failures (no words ever
received). Once any word was received, _ever_received_word became True
and the health check was effectively disabled for the session.
@docteurZ docteurZ requested a review from a team as a code owner February 7, 2026 17:19
… threshold

The new VAD filtering in PerParticipantNonStreamingAudioInputManager
discards utterances shorter than 200ms. MockPCMAudioFrame was only
generating 10ms of audio, causing all zoom bot tests that depend on
utterance creation to fail.
… is disabled

Add use_streaming_transcription() to should_capture_audio_chunks() and
bump test audio duration to meet MIN_DURATION_MS threshold.
Zoom SDK delivers audio/video on separate C++ threads that compete for
Python's GIL. When the video thread holds the GIL during scale_i420()
or blocks on push-buffer while x264 catches up, the audio callback
can't run and the SDK silently drops 10ms audio frames. This produces
thousands of micro-gaps that worsen as participant count increases.

Three changes:

1. Decouple SDK audio callback via deque + drain thread
   (zoom_bot_adapter.py)
   The callback now just copies bytes and returns in ~20μs.
   A separate thread pushes to GStreamer at its own pace.
   Timestamps captured at SDK delivery time preserve A/V sync.

2. Set block=False on video and audio appsrcs
   (gstreamer_pipeline.py)
   push-buffer returns immediately regardless of pipeline backpressure.

3. Make video queues q1/q2/q3 leaky=downstream
   (gstreamer_pipeline.py)
   x264 backpressure drops old video frames instead of propagating
   to the muxer and starving the audio path.

Google Meet/Teams are unaffected — they use separate pipelines or none.
Zoom RTMS uses its own RTMSGstreamerPipeline class, also unaffected.

Tradeoff: if x264 can't keep up, video frames drop (logged by queue
monitor) instead of invisible audio gaps. Video drops are far less
perceptible than audio micro-gaps.
- Decouple audio SDK callback from GStreamer via deque + drain thread
- Decouple video SDK callback from scale_i420 via deque + drain thread
- Set block=False on appsrcs, leaky downstream video queues
- Remove audiorate element (amplified jitter into silence insertions)
- Pass SDK timestamps through to GStreamer for accurate A/V sync
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants