Susurrus is a professional, modular audio transcription application that leverages various AI models and backends to convert speech to text. Built with a clean architecture, it supports multiple Whisper implementations, speaker diarization, and extensive customization options.
- Multiple Backend Support: mlx-whisper, OpenAI Whisper, faster-whisper, transformers, whisper.cpp, ctranslate2, whisper-jax, insanely-fast-whisper, Voxtral
- Flexible Input: Local files, URLs, also of videos
- Audio Format Support: MP3, WAV, FLAC, M4A, AAC, OGG, OPUS, WebM, MP4, WMA
- Language Detection: Automatic or manual language selection
- Time-based Trimming: Transcribe specific portions of audio
- Word-level Timestamps: Precise timing information (backend-dependent)
- Multi-speaker Identification: Automatically detect and label different speakers
- Language-specific Models: Optimized models for English, German, Chinese, Spanish, Japanese
- Configurable Parameters: Set min/max speaker counts
- Multiple Output Formats: TXT, SRT, VTT, JSON with speaker labels
- PyAnnote.audio Integration: State-of-the-art diarization engine
- Voxtral Local: On-device inference with Mistral's speech model
- Voxtral API: Cloud-based inference via Mistral AI API
- 8 Language Support: EN, FR, ES, DE, IT, PT, PL, NL
- Long Audio Processing: Automatic chunking for files over 25 minutes
- Proxy Support: HTTP/SOCKS5 proxy for network requests
- Device Selection: Auto-detect or manually choose CPU/GPU/MPS
- Model Conversion: Automatic CTranslate2 model conversion
- Progress Tracking: Real-time progress with ETA estimation
- Settings Persistence: Save your preferences between sessions
- Dependency Management: Built-in installer for missing components
- CUDA Diagnostics: Detailed GPU/CUDA troubleshooting tools
# Clone the repository
git clone https://github.com/CrispStrobe/Susurrus.git
cd Susurrus
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run the application
python main.py
# Or as a module:
python -m susurrus- Python 3.8+
- FFmpeg (for audio format conversion)
- Git
- C++ compiler (for whisper.cpp, optional)
- CUDA Toolkit (for GPU acceleration, optional)
# Install Chocolatey (if not installed)
Set-ExecutionPolicy Bypass -Scope Process -Force
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
# Install dependencies
choco install cmake ffmpeg git python
# For GPU support
choco install cuda# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install dependencies
brew install ffmpeg cmake python git
# For Apple Silicon optimization
pip install mlx mlx-whisper# Install dependencies
sudo apt update
sudo apt install ffmpeg cmake build-essential python3 python3-pip git
# For GPU support
# Follow CUDA installation guide for your distribution# MLX (Apple Silicon only)
pip install mlx-whisper
# Faster Whisper (recommended)
pip install faster-whisper
# Transformers
pip install transformers torch torchaudio
# Whisper.cpp (manual build required)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && mkdir build && cd build
cmake .. && make
# CTranslate2
pip install ctranslate2
# Whisper-JAX
pip install whisper-jax
# Insanely Fast Whisper
pip install insanely-fast-whisper
# Voxtral (requires dev transformers)
pip uninstall transformers -y
pip install git+https://github.com/huggingface/transformers.git
pip install mistral-common[audio] soundfile# Install pyannote.audio
pip install pyannote.audio
# Get Hugging Face token
# 1. Sign up at https://huggingface.co
# 2. Create token at https://huggingface.co/settings/tokens
# 3. Accept license at https://huggingface.co/pyannote/speaker-diarization
# Set token (choose one method):
# Method 1: Environment variable
export HF_TOKEN="your_token_here" # Linux/macOS
setx HF_TOKEN "your_token_here" # Windows
# Method 2: Config file
mkdir -p ~/.huggingface
echo "your_token_here" > ~/.huggingface/token
# Method 3: Enter in GUI# Get Mistral API key from https://console.mistral.ai/
# Set API key (choose one method):
# Method 1: Environment variable
export MISTRAL_API_KEY="your_key_here" # Linux/macOS
setx MISTRAL_API_KEY "your_key_here" # Windows
# Method 2: Config file
mkdir -p ~/.mistral
echo "your_key_here" > ~/.mistral/api_key
# Method 3: Enter in GUI# Start the application
python main.py
# Or as a module
python -m susurrusBasic Workflow:
- Select Audio Source: Choose file or enter URL
- Choose Backend: Select transcription engine
- Configure Options: Set language, model, device
- Enable Diarization (optional): Identify speakers
- Start Transcription: Click "Transcribe"
- Save Results: Export to TXT, SRT, or VTT
python workers/transcribe_worker.py \
--audio-input audio.mp3 \
--backend faster-batched \
--model-id large-v3 \
--language en \
--device autopython workers/diarize_worker.py \
--audio-input audio.mp3 \
--hf-token YOUR_TOKEN \
--transcribe \
--model-id base \
--backend faster-batched \
--output-formats txt,srt,vtt# Transcription backend example
from workers.transcription.backends import get_backend
backend = get_backend(
'faster-batched',
model_id='large-v3',
device='auto',
language='en'
)
for start, end, text in backend.transcribe('audio.mp3'):
print(f"[{start:.2f}s -> {end:.2f}s] {text}")# Diarization example
from backends.diarization import DiarizationManager
manager = DiarizationManager(hf_token="YOUR_TOKEN")
segments, files = manager.diarize_and_split('audio.mp3')
for segment in segments:
print(f"{segment['speaker']}: {segment['text']}")susurrus/
βββ main.py # Application entry point
βββ config.py # Central configuration
βββ backends/ # Transcription & diarization backends
β βββ diarization/ # Speaker diarization module
β β βββ manager.py # Diarization orchestration
β β βββ progress.py # Enhanced progress tracking
β βββ transcription/ # Transcription backends
β βββ voxtral_local.py # Voxtral local inference
β βββ voxtral_api.py # Voxtral API integration
βββ gui/ # User interface components
β βββ main_window.py # Main application window
β βββ widgets/ # Custom widgets
β β βββ collapsible_box.py
β β βββ diarization_settings.py
β β βββ voxtral_settings.py
β β βββ advanced_options.py
β βββ dialogs/ # Dialog windows
β βββ dependencies_dialog.py
β βββ installer_dialog.py
β βββ cuda_diagnostics_dialog.py
βββ workers/ # Background processing
β βββ transcription_thread.py # GUI thread wrapper
β βββ transcribe_worker.py # Standalone transcription worker
β βββ diarize_worker.py # Standalone diarization worker
β βββ transcription/ # Transcription backend implementations
β βββ backends/
β β βββ base.py # Base backend interface
β β βββ mlx_backend.py
β β βββ faster_whisper_backend.py
β β βββ transformers_backend.py
β β βββ whisper_cpp_backend.py
β β βββ ctranslate2_backend.py
β β βββ whisper_jax_backend.py
β β βββ insanely_fast_backend.py
β β βββ openai_whisper_backend.py
β β βββ voxtral_backend.py
β βββ utils.py
βββ utils/ # Utility modules
β βββ device_detection.py # CUDA/MPS/CPU detection
β βββ audio_utils.py # Audio processing utilities
β βββ download_utils.py # URL downloading
β βββ dependency_check.py # Dependency verification
β βββ format_utils.py # Time formatting utilities
βββ models/ # Model configuration
β βββ model_config.py # Model mappings & utilities
βββ scripts/ # Standalone utility scripts
βββ test_voxtral.py # Voxtral testing
βββ pyannote_torch26.py # PyTorch 2.6+ compatibility
# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_backends.py
# Run with coverage
pytest --cov=. --cov-report=html# Format code
black .
# Lint
flake8 .
pylint susurrus/
# Type checking
mypy .- Create a new file in
workers/transcription/backends/ - Inherit from
TranscriptionBackend - Implement required methods:
class MyBackend(TranscriptionBackend): def transcribe(self, audio_path): # Yield (start, end, text) tuples pass def preprocess_audio(self, audio_path): # Optional preprocessing return audio_path def cleanup(self): # Optional cleanup pass
- Register in
workers/transcription/backends/__init__.py - Add to
config.pyBACKEND_MODEL_MAP
- Windows:
%APPDATA%\Susurrus\AudioTranscription.ini - macOS:
~/Library/Preferences/com.Susurrus.AudioTranscription.plist - Linux:
~/.config/Susurrus/AudioTranscription.conf
HF_TOKEN: Hugging Face API token (diarization)MISTRAL_API_KEY: Mistral AI API key (Voxtral)PYTORCH_MPS_HIGH_WATERMARK_RATIO: MPS memory optimizationCUDA_VISIBLE_DEVICES: GPU selection
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Verify CUDA is available
python -c "import torch; print(torch.cuda.is_available())"# Use MLX backend for best performance
pip install mlx-whisper
# Or use MPS device with other backends
# Will auto-detect in GUI- Use smaller models for limited RAM
- Enable chunking for long audio files
- Use
faster-batchedbackend with appropriate batch size - Close other applications during processing
"No module named 'X'"
pip install XFFmpeg not found
# Verify installation
ffmpeg -version
# Add to PATH if needed (Windows)
setx PATH "%PATH%;C:\path\to\ffmpeg\bin"CUDA errors
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
# Use Tools > CUDA Diagnostics in GUI for detailed infoDiarization authentication fails
# Verify token
python -c "from huggingface_hub import HfApi; HfApi().whoami(token='YOUR_TOKEN')"
# Accept license
# Visit: https://huggingface.co/pyannote/speaker-diarizationPyTorch 2.6+ compatibility issues
# Run the compatibility script
python scripts/pyannote_torch26.py# Fork and clone
git clone https://github.com/YOUR_USERNAME/Susurrus.git
cd Susurrus
# Create feature branch
git checkout -b feature-name
# Install dev dependencies
pip install -r requirements-dev.txt
# Make changes and test
pytest tests/
# Submit PR- OpenAI Whisper - Original Whisper model
- MLX - Apple Silicon acceleration
- Faster Whisper - Optimized inference
- PyAnnote.audio - Speaker diarization
- Mistral AI - Voxtral model
- Hugging Face - Model hosting and transformers
- yt-dlp - For URL downloading