Skip to content

Commit 3d2635a

Browse files
committed
VAD issues
1 parent ba1d832 commit 3d2635a

18 files changed

+2344
-1609
lines changed

CHANGELOG.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,41 @@
22

33
All significant changes to this project are documented here.
44

5+
## [Unreleased]
6+
7+
### Fixed
8+
- 🐛 **TextRefiner topic detection now respects backend setting**
9+
- Fixed hardcoded Ollama call in `_detect_topic()` method
10+
- Topic detection now correctly uses OpenAI API when `--refine-backend openai_api` is specified
11+
- Resolves 404 error when using OpenAI backend without Ollama running
12+
13+
### Added
14+
-**OpenAI support for Whisper prompt generation**
15+
- Enhanced `create_whisper_prompt_with_llm()` to support both Ollama and OpenAI backends
16+
- Automatically uses the same backend as refinement (`--refine-backend`) for prompt generation
17+
- Improves consistency when using OpenAI API throughout the pipeline
18+
- 🎯 **Audio preprocessing for speaker diarization**
19+
- Automatic conversion to mono 16kHz
20+
- RMS volume normalization to -20 dBFS
21+
- Clipping prevention
22+
- Optional noise reduction support (via noisereduce library)
23+
- Helps reduce false speaker clusters from volume variations and background noise
24+
- New function: `_preprocess_audio_for_diarization()`
25+
26+
### Documentation
27+
- 📝 **Added FAQ entries for speaker diarization warnings**
28+
- Documented torchcodec FFmpeg version warning (safe to ignore)
29+
- Documented pyannote std() warning (safe to ignore)
30+
- Explained fallback audio loading mechanism
31+
- Added quick reference in README troubleshooting section
32+
- 📝 **Added speaker diarization accuracy information**
33+
- New FAQ section: "How accurate is speaker diarization?"
34+
- Documented over-segmentation limitation (one speaker → multiple labels)
35+
- Added accuracy guidelines based on audio quality
36+
- Recommendation to verify speaker labels manually for critical applications
37+
- Added warning note in README highlights
38+
- Documented automatic audio preprocessing features
39+
540
## [1.3.0] - 2025-10-10
641

742
### Added

CHEATSHEET.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ python -m src.main --url "..." --transcribe whisper_base --prompt prompt.txt
9090
## 📁 Project layout
9191

9292
```
93-
youtube-transcriber/
93+
yt-transcriber/
9494
├── src/ # Source code
9595
├── tests/ # Automated tests
9696
├── output/ # Results ← start here
@@ -192,9 +192,9 @@ ffmpeg -version # verify FFmpeg
192192
## 🐳 Docker essentials
193193

194194
```bash
195-
docker build -t youtube-transcriber .
195+
docker build -t yt-transcriber .
196196

197-
docker run -v $(pwd)/output:/app/output youtube-transcriber --url "YOUTUBE_URL" --transcribe whisper_base
197+
docker run -v $(pwd)/output:/app/output yt-transcriber --url "YOUTUBE_URL" --transcribe whisper_base
198198

199199
# docker compose
200200
docker-compose up # foreground

FAQ.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,45 @@ Always proofread technical translations.
132132

133133
---
134134

135+
### How accurate is speaker diarization?
136+
137+
Speaker diarization (enabled with `--speakers`) identifies different speakers in audio and labels them as SPEAKER_00, SPEAKER_01, etc.
138+
139+
**Accuracy depends on:**
140+
- **Audio quality:** single-channel recordings may reduce accuracy
141+
- **Recording setup:** studio mics (high quality) vs phone recordings (lower quality)
142+
- **Speaker overlap:** people talking over each other causes confusion
143+
- **Voice similarity:** similar-sounding speakers are harder to distinguish
144+
- **Microphone distance changes:** one speaker moving closer/farther may be split into multiple labels
145+
146+
**Typical results:**
147+
- ✅ Clean studio recordings with distinct voices: 85–95% accuracy
148+
- ⚠️ Phone/video calls: 70–85% accuracy
149+
- ⚠️ Noisy environments or overlapping speech: 50–70% accuracy
150+
151+
**Known limitations:**
152+
- ⚠️ **Over-segmentation:** One speaker may be assigned multiple labels (e.g., SPEAKER_00 and SPEAKER_01 for the same person)
153+
- Common when speaker changes tone, distance from mic, or there are long pauses
154+
- Manual review recommended for critical applications
155+
- ⚠️ **Under-segmentation:** Multiple speakers may be assigned the same label
156+
- Less common, happens with very similar voices
157+
158+
**Recommendation:** Use speaker labels as a guide, but verify important segments manually.
159+
160+
**Built-in audio preprocessing:**
161+
The system automatically preprocesses audio before diarization to improve accuracy:
162+
- ✅ Conversion to mono 16kHz (standard for speech models)
163+
- ✅ RMS volume normalization to -20 dBFS (prevents quiet sections from being misclassified)
164+
- ✅ Clipping prevention (avoids distortion)
165+
- 🔄 Optional noise reduction (available with `noisereduce` library - install separately)
166+
167+
This preprocessing helps reduce false speaker clusters caused by:
168+
- Volume variations (one speaker at different distances from mic)
169+
- Background noise (can be classified as separate "speaker")
170+
- Audio quality inconsistencies
171+
172+
---
173+
135174
## Troubleshooting
136175

137176
### “FFmpeg not found”
@@ -228,6 +267,68 @@ Ensure the desired model is pulled (`ollama pull qwen2.5:3b`).
228267

229268
---
230269

270+
### Warning: "torchcodec is not installed correctly" (Speaker Diarization)
271+
272+
**Message:**
273+
```
274+
UserWarning: torchcodec is not installed correctly so built-in audio decoding will fail.
275+
Could not load libtorchcodec... FFmpeg is not properly installed...
276+
We support versions 4, 5, 6 and 7.
277+
```
278+
279+
**Cause:** FFmpeg 8.0 is installed, but pyannote's torchcodec expects FFmpeg 4-7.
280+
281+
**Is this critical?****No, this is safe to ignore.**
282+
283+
The speaker diarization system has built-in fallback audio loaders:
284+
1. First tries `soundfile` (doesn't need FFmpeg)
285+
2. Falls back to `librosa` if needed
286+
3. Only uses direct file loading as last resort
287+
288+
**What happens:**
289+
- ✅ Speaker diarization works correctly
290+
- ✅ Audio is loaded via soundfile/librosa
291+
- ⚠️ Warning appears but can be ignored
292+
293+
**If you want to suppress the warning:**
294+
295+
Option 1: Keep FFmpeg 8 (recommended, everything works)
296+
```bash
297+
# Do nothing - the fallback works perfectly
298+
```
299+
300+
Option 2: Downgrade to FFmpeg 7 (optional, only to remove warning)
301+
```bash
302+
# macOS
303+
brew uninstall ffmpeg
304+
brew install ffmpeg@7
305+
brew link ffmpeg@7
306+
```
307+
308+
**Note:** Downgrading FFmpeg is unnecessary since the fallback mechanism works reliably.
309+
310+
---
311+
312+
### Warning: "std(): degrees of freedom is <= 0" (Speaker Diarization)
313+
314+
**Message:**
315+
```
316+
UserWarning: std(): degrees of freedom is <= 0. Correction should be strictly less than...
317+
```
318+
319+
**Cause:** Internal pyannote.audio calculation during speaker diarization.
320+
321+
**Is this critical?****No, this is safe to ignore.**
322+
323+
This warning appears during normal operation of the speaker diarization pipeline and does not affect:
324+
- ✅ Accuracy of speaker detection
325+
- ✅ Quality of diarization results
326+
- ✅ Stability of the process
327+
328+
**What to do:** Nothing - the process will complete successfully and identify speakers correctly.
329+
330+
---
331+
231332
### Where are logs and outputs saved?
232333

233334
- Transcripts & translations: `output/`

QUICKSTART.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Get up and running with YouTube Transcriber in five minutes.
99
```bash
1010
# Clone the repository
1111
git clone <repository-url>
12-
cd youtube-transcriber
12+
cd yt-transcriber
1313

1414
# Create a virtual environment
1515
python -m venv venv
@@ -104,7 +104,7 @@ python -m src.main --url "URL" --transcribe whisper_base --translate NLLB
104104
## 📁 Where to find results
105105

106106
```
107-
youtube-transcriber/
107+
yt-transcriber/
108108
├── output/ # ← Processed documents
109109
│ ├── Video_Title.docx
110110
│ └── Video_Title.md
@@ -239,7 +239,7 @@ Speedups:
239239
## 💬 Need help?
240240

241241
- Check the [FAQ](FAQ.md)
242-
- Open an [issue on GitHub](https://github.com/yourusername/youtube-transcriber/issues)
242+
- Open an [issue on GitHub](https://github.com/yourusername/yt-transcriber/issues)
243243
- Reach out to the maintainers
244244

245245
---

README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ A flexible toolkit for transcribing and translating YouTube videos, audio files,
1111
- Works with both local Whisper and OpenAI API
1212
- Optimal speaker detection using VAD integration
1313
- Enable with `--speakers` flag
14+
- ⚠️ Note: May over-segment speakers (one person → multiple labels); manual review recommended for critical use
1415
-**Enhanced logging with colored output**
1516
- Color-coded log levels for better visibility
1617
- WARNING messages in orange for important notices
@@ -84,7 +85,7 @@ A flexible toolkit for transcribing and translating YouTube videos, audio files,
8485

8586
```bash
8687
git clone <repository-url>
87-
cd youtube-transcriber
88+
cd yt-transcriber
8889
```
8990

9091
### 2. Create a virtual environment
@@ -291,7 +292,7 @@ python -m src.main --help
291292
## 📁 Project structure
292293

293294
```
294-
youtube-transcriber/
295+
yt-transcriber/
295296
├── src/ # Source code
296297
│ ├── main.py # Entry point
297298
│ ├── config.py # Configuration
@@ -384,6 +385,11 @@ python -m src.main --url "..." --transcribe whisper_base
384385
- Confirm that GPU/MPS acceleration is active (see logs)
385386
- Close other resource-heavy applications
386387

388+
**Safe to ignore:** Speaker diarization warnings
389+
- `UserWarning: torchcodec is not installed correctly` — Audio loading uses soundfile/librosa fallback (works correctly)
390+
- `UserWarning: std(): degrees of freedom is <= 0` — Internal pyannote calculation (does not affect results)
391+
- See [FAQ.md](FAQ.md) for detailed explanations
392+
387393
## 🧪 Testing
388394

389395
```bash

0 commit comments

Comments
 (0)