A powerful toolkit for automated video transcription and subtitle translation, designed for batch processing educational content and lecture videos.
-
Batch Video Transcription: Extract audio and transcribe using whisper.cpp
- Automatic audio extraction via ffmpeg (piped, no temp files)
- Voice Activity Detection (VAD) for improved accuracy
- Smart segmentation to prevent hallucinations on long videos
- Pipeline parallelism: transcribe while translating
-
Intelligent Subtitle Translation: Translate SRT files using OpenAI-compatible APIs
- Context-aware batch translation
- Support for multiple AI providers (NVIDIA, Azure OpenAI)
- Preserves timing and formatting
- High concurrency for fast processing
- Python 3.7+
- whisper.cpp (compiled with binaries)
- ffmpeg and ffprobe
pip install openaiDownload whisper models from whisper.cpp models:
- Speech recognition model (e.g.,
ggml-large-v2.bin) - VAD model (optional but recommended):
ggml-silero-v5.1.2.bin
# Basic usage: transcribe all videos in a directory and translate to Chinese
python batch_transcribe.py "path/to/videos" --target zh
# With custom settings
python batch_transcribe.py "path/to/videos" --target zh \
--segment-duration 10 \
--max-workers 2
# Process specific files with pattern matching
python batch_transcribe.py "path/to/videos" --target zh --pattern "^01_.*"# Using NVIDIA API (default)
export NV_API_KEY="your-api-key"
python translate_srt.py input.srt --target zh
# Using Azure OpenAI
export AZURE_OPENAI_API_KEY="your-api-key"
python translate_srt.py input.srt --target zh --api-provider azureBatch process video files with transcription and translation:
python batch_transcribe.py <video_directory> [options]Key Options:
--target, -t: Target language code (zh, en, ja, etc.) - required--segment-duration, -s: Split videos into N-minute segments (default: 16)--max-workers, -j: Number of videos to process in parallel (default: 2)--max-transcribe: Maximum concurrent transcriptions (default: 1)--pattern, -p: Regex pattern to filter video files--whisper-bin: Path to whisper-cli binary--whisper-model: Path to whisper model file--no-skip-existing: Force reprocessing of existing files
Examples:
# Process with 10-minute segments for better accuracy
python batch_transcribe.py "Videos" --target zh -s 10
# High concurrency: 6 workers, 2 transcribing simultaneously
python batch_transcribe.py "Videos" --target en --max-workers 6 --max-transcribe 2
# Process only files containing "火山" in the name
python batch_transcribe.py "Videos" --target zh --pattern "火山"Translate SRT subtitle files:
python translate_srt.py <input.srt> --target <language> [options]Key Options:
--target, -t: Target language code - required--output, -o: Output file path (default:input.{target}.srt)--api-provider, -p: API provider:nv(NVIDIA) orazure(default: nv)--model: Model name (default: deepseek-ai/deepseek-v3.1)--batch-size: Subtitles per batch (default: 10)--context-size: Context blocks before/after batch (default: 1)
Examples:
# Translate to English with custom output
python translate_srt.py video.srt --target en --output video_english.srt
# Use Azure OpenAI with larger batches
python translate_srt.py video.srt --target zh \
--api-provider azure \
--batch-size 20Default paths can be overridden with command-line arguments:
DEFAULT_WHISPER_BIN = "~/Development/whisper.cpp/build/bin/whisper-cli"
DEFAULT_WHISPER_MODEL = "~/Development/whisper.cpp/models/ggml-large-v2.bin"
DEFAULT_VAD_MODEL = "~/Development/whisper.cpp/models/ggml-silero-v5.1.2.bin"For translate_srt.py:
NV_API_KEY: NVIDIA API key (when using--api-provider nv)AZURE_OPENAI_API_KEY: Azure OpenAI key (when using--api-provider azure)
Translation supports common languages via OpenAI-compatible APIs:
zh- Simplified Chineseen- Englishja- Japaneseko- Koreanes- Spanishfr- Frenchde- German- And more...
The toolkit uses intelligent parallelism:
- Transcription (CPU-intensive): Serialized by default to avoid CPU contention
- Translation (I/O-intensive): High concurrency for API calls
This allows Video A to transcribe while Video B translates, maximizing throughput.
Long videos are automatically split into segments:
- Prevents hallucinations common in whisper models
- Default: 16-minute segments
- Each segment is transcribed independently, then merged with correct timestamps
For a video lecture.mp4:
lecture.mp4 # Original video
lecture.srt # Transcribed subtitles (original language)
lecture.zh.srt # Translated subtitles (Chinese)
This project is provided as-is for personal and educational use.