Silero VAD#608
Open
docteurZ wants to merge 12 commits into
Open
Conversation
…torch. Didn't manage to it in requirement.txt without pulling 2Gb of Cuda
- WebRTC VAD (default): Fast, lightweight, good for most cases - Silero VAD: More accurate, especially for noisy environments
…ering - Pluggable VAD: WebRTC (default) or Silero via VAD_PROVIDER env var - Trim trailing silence from utterances before saving - Filter low-quality utterances: min 15% speech ratio, min 200ms duration
…cing false cutoffs and utterance fragmentation.
…ic VAD on the server side). add a basic RMS energy filter to avoid sending pure silence though.
…ion. Streaming providers (Deepgram, Kyutai) have their own VAD in the streaming manager. Async transcription uses audio_chunk_buffer_manager inside the streaming manager
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR improves audio processing and VAD reliability for transcription.
It refactors utterance buffering to explicitly track silence, trims trailing silence, and filters low-quality utterances, reducing noise and false positives. VAD is now pluggable (WebRTC or Silero via VAD_PROVIDER). Silero is more CPU-intensive but generally more robust in noisy conditions, and it does not require high RMS levels to perform well. Docker was updated to support CPU-only Silero.
Streaming audio can optionally buffer chunks for async transcription, with cleaner flushing and tuned silence thresholds. Overall, this improves accuracy, configurability, and async transcription support.