Skip to content

Commit ba1d832

Browse files
committed
tuning
1 parent e8d23e0 commit ba1d832

File tree

10 files changed

+2143
-714
lines changed

10 files changed

+2143
-714
lines changed

README.md

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,21 @@ A flexible toolkit for transcribing and translating YouTube videos, audio files,
44

55
## 🎯 Highlights
66

7-
### Version 1.4 (current)
7+
### Version 1.5 (current)
8+
-**Speaker Diarization**
9+
- Automatic speaker identification using pyannote.audio
10+
- Speaker labels in transcripts ([SPEAKER_00], [SPEAKER_01], etc.)
11+
- Works with both local Whisper and OpenAI API
12+
- Optimal speaker detection using VAD integration
13+
- Enable with `--speakers` flag
14+
-**Enhanced logging with colored output**
15+
- Color-coded log levels for better visibility
16+
- WARNING messages in orange for important notices
17+
- INFO messages in green for successful operations
18+
- ERROR/CRITICAL messages in red for failures
19+
- Smart warnings (e.g., missing Whisper prompt suggestions)
20+
21+
### Version 1.4
822
-**Video file support**
923
- Process local video files (MP4, MKV, AVI, MOV, etc.)
1024
- Automatic audio extraction using FFmpeg
@@ -45,9 +59,8 @@ A flexible toolkit for transcribing and translating YouTube videos, audio files,
4559
- ✅ Apple M1/M2 optimisations
4660

4761
### In progress
48-
- 🔄 Whisper via OpenAI API
49-
- 🔄 Translation via OpenAI API
50-
- 🔄 Speaker diarisation
62+
- 🔄 Optimized chunk processing for OpenAI API
63+
- 🔄 Batch processing support
5164
- 🔄 Docker support
5265

5366
## 📋 Requirements
@@ -188,6 +201,32 @@ Produces two documents:
188201
python -m src.main --url "https://youtube.com/watch?v=YOUR_VIDEO_ID" --transcribe whisper_base --prompt prompt.txt
189202
```
190203

204+
#### 7. Enable speaker diarization (v1.5)
205+
206+
```bash
207+
# Transcribe with automatic speaker identification
208+
python -m src.main \
209+
--url "https://youtube.com/watch?v=YOUR_VIDEO_ID" \
210+
--transcribe whisper_medium \
211+
--speakers
212+
```
213+
214+
**Requirements for speaker diarization:**
215+
1. Get HuggingFace token: https://huggingface.co/settings/tokens (create a "Read" token)
216+
2. Accept model terms for all required models:
217+
- https://huggingface.co/pyannote/speaker-diarization-3.1
218+
- https://huggingface.co/pyannote/segmentation-3.0
219+
- https://huggingface.co/pyannote/speaker-diarization-community-1
220+
- https://huggingface.co/pyannote/voice-activity-detection (optional, for better chunking)
221+
3. Set token in environment: `export HF_TOKEN=your_token_here` (add to `~/.zshrc` or `~/.bashrc`)
222+
223+
Output will include speaker labels:
224+
```
225+
[00:00] [SPEAKER_00] Hello everyone, welcome to the show
226+
[00:05] [SPEAKER_01] Thanks for having me
227+
[00:08] [SPEAKER_00] Let's get started with today's topic
228+
```
229+
191230
## ⚖️ Legal notice
192231
- Make sure you respect YouTube Terms of Service and copyright law before downloading or processing any content. Only use the tool for media you own or have explicit permission to process.
193232
- Output documents and logs may contain fragments of the original content. Store them locally and review licences before sharing.

ROADMAP.md

Lines changed: 32 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,32 @@
22

33
## Recently Completed
44

5+
### ✅ Speaker Diarization (v1.5.0)
6+
**Status:** Implemented and committed (2025-10-13)
7+
8+
**What was done:**
9+
- Implemented speaker identification using pyannote.audio
10+
- Added `_perform_speaker_diarization()` method in Transcriber
11+
- Integrated speaker labels with TranscriptionSegment
12+
- Updated document_writer to format speaker labels in output
13+
- Works with both local Whisper and OpenAI Whisper API
14+
- Graceful fallback when HF_TOKEN not available
15+
- Added comprehensive test suite (8 tests)
16+
17+
**Technical details:**
18+
- Uses pyannote/speaker-diarization-3.1 model
19+
- Assigns speakers based on maximum overlap with speech segments
20+
- Seamlessly integrates with existing VAD infrastructure
21+
- Speaker labels automatically included in DOCX and Markdown outputs
22+
- Enable with `--speakers` CLI flag
23+
24+
**Benefits:**
25+
- Better readability for multi-speaker content (interviews, podcasts)
26+
- Professional quality output with clear speaker attribution
27+
- Foundation for future speaker name mapping
28+
29+
---
30+
531
### ✅ VAD-based Intelligent Audio Chunking (v1.4.0)
632
**Status:** Implemented and committed (2025-10-12)
733

@@ -23,34 +49,8 @@
2349

2450
### 🎯 High Priority
2551

26-
#### 1. Speaker Diarization
27-
**Target:** v1.5.0
28-
**Dependencies:** HF_TOKEN, pyannote.audio (already installed)
29-
30-
**Description:**
31-
Implement speaker identification and labeling in transcriptions using pyannote.audio.
32-
33-
**Implementation plan:**
34-
- Add `_perform_speaker_diarization()` method using `pyannote/speaker-diarization-3.1`
35-
- Integrate speaker labels with TranscriptionSegment (already has speaker field)
36-
- Match diarization timestamps with Whisper transcription segments
37-
- Add `--with-speakers` CLI flag functionality
38-
- Update document output to include speaker labels (e.g., "[Speaker 1]:")
39-
- Reuse VAD data from chunking to improve diarization accuracy
40-
41-
**Benefits:**
42-
- Better readability for multi-speaker content (interviews, podcasts)
43-
- Synergy with existing VAD implementation
44-
- Professional quality output
45-
46-
**Requirements:**
47-
- HuggingFace token with access to pyannote/speaker-diarization-3.1
48-
- Accept terms: https://huggingface.co/pyannote/speaker-diarization-3.1
49-
50-
---
51-
52-
#### 2. Optimized Chunk Processing for OpenAI API
53-
**Target:** v1.4.1
52+
#### 1. Optimized Chunk Processing for OpenAI API
53+
**Target:** v1.5.1
5454
**Dependencies:** None
5555

5656
**Description:**
@@ -70,8 +70,8 @@ Improve processing efficiency when handling chunked audio files.
7070

7171
---
7272

73-
#### 3. Batch Processing Support
74-
**Target:** v1.5.0
73+
#### 2. Batch Processing Support
74+
**Target:** v1.6.0
7575
**Dependencies:** None
7676

7777
**Description:**
@@ -182,6 +182,7 @@ If you want to work on any of these features:
182182

183183
## Version History
184184

185+
- **v1.5.0** (2025-10-13): Speaker diarization
185186
- **v1.4.0** (2025-10-12): VAD-based intelligent chunking
186187
- **v1.3.0** (2025-10-XX): OpenAI API integration (Whisper + GPT)
187188
- **v1.2.0** (2025-XX-XX): NLLB translation support
@@ -190,5 +191,5 @@ If you want to work on any of these features:
190191

191192
---
192193

193-
**Last updated:** 2025-10-12
194+
**Last updated:** 2025-10-13
194195
**Maintainer:** @biyachuev

0 commit comments

Comments
 (0)