Skip to content

speech to text gui for different (mostly Whisper, also Voxtral) models and backends, including whisper.cpp, mlx-whisper, faster-whisper, ctranslate2; applies pyannote for diarization

Notifications You must be signed in to change notification settings

CrispStrobe/Susurrus

Repository files navigation

Susurrus: Audio Transcription Suite

Susurrus is a professional, modular audio transcription application that leverages various AI models and backends to convert speech to text. Built with a clean architecture, it supports multiple Whisper implementations, speaker diarization, and extensive customization options.

✨ Features

Core Transcription

  • Multiple Backend Support: mlx-whisper, OpenAI Whisper, faster-whisper, transformers, whisper.cpp, ctranslate2, whisper-jax, insanely-fast-whisper, Voxtral
  • Flexible Input: Local files, URLs, also of videos
  • Audio Format Support: MP3, WAV, FLAC, M4A, AAC, OGG, OPUS, WebM, MP4, WMA
  • Language Detection: Automatic or manual language selection
  • Time-based Trimming: Transcribe specific portions of audio
  • Word-level Timestamps: Precise timing information (backend-dependent)

Speaker Diarization

  • Multi-speaker Identification: Automatically detect and label different speakers
  • Language-specific Models: Optimized models for English, German, Chinese, Spanish, Japanese
  • Configurable Parameters: Set min/max speaker counts
  • Multiple Output Formats: TXT, SRT, VTT, JSON with speaker labels
  • PyAnnote.audio Integration: State-of-the-art diarization engine

Voxtral Support (New!)

  • Voxtral Local: On-device inference with Mistral's speech model
  • Voxtral API: Cloud-based inference via Mistral AI API
  • 8 Language Support: EN, FR, ES, DE, IT, PT, PL, NL
  • Long Audio Processing: Automatic chunking for files over 25 minutes

Advanced Features

  • Proxy Support: HTTP/SOCKS5 proxy for network requests
  • Device Selection: Auto-detect or manually choose CPU/GPU/MPS
  • Model Conversion: Automatic CTranslate2 model conversion
  • Progress Tracking: Real-time progress with ETA estimation
  • Settings Persistence: Save your preferences between sessions
  • Dependency Management: Built-in installer for missing components
  • CUDA Diagnostics: Detailed GPU/CUDA troubleshooting tools

πŸ“¦ Installation

Quick Start

# Clone the repository
git clone https://github.com/CrispStrobe/Susurrus.git
cd Susurrus

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the application
python main.py
# Or as a module:
python -m susurrus

Prerequisites

  • Python 3.8+
  • FFmpeg (for audio format conversion)
  • Git
  • C++ compiler (for whisper.cpp, optional)
  • CUDA Toolkit (for GPU acceleration, optional)

Platform-Specific Setup

Windows

# Install Chocolatey (if not installed)
Set-ExecutionPolicy Bypass -Scope Process -Force
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))

# Install dependencies
choco install cmake ffmpeg git python

# For GPU support
choco install cuda

macOS

# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install dependencies
brew install ffmpeg cmake python git

# For Apple Silicon optimization
pip install mlx mlx-whisper

Linux (Ubuntu/Debian)

# Install dependencies
sudo apt update
sudo apt install ffmpeg cmake build-essential python3 python3-pip git

# For GPU support
# Follow CUDA installation guide for your distribution

Optional Backend Installation

# MLX (Apple Silicon only)
pip install mlx-whisper

# Faster Whisper (recommended)
pip install faster-whisper

# Transformers
pip install transformers torch torchaudio

# Whisper.cpp (manual build required)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && mkdir build && cd build
cmake .. && make

# CTranslate2
pip install ctranslate2

# Whisper-JAX
pip install whisper-jax

# Insanely Fast Whisper
pip install insanely-fast-whisper

# Voxtral (requires dev transformers)
pip uninstall transformers -y
pip install git+https://github.com/huggingface/transformers.git
pip install mistral-common[audio] soundfile

Speaker Diarization Setup

# Install pyannote.audio
pip install pyannote.audio

# Get Hugging Face token
# 1. Sign up at https://huggingface.co
# 2. Create token at https://huggingface.co/settings/tokens
# 3. Accept license at https://huggingface.co/pyannote/speaker-diarization

# Set token (choose one method):
# Method 1: Environment variable
export HF_TOKEN="your_token_here"  # Linux/macOS
setx HF_TOKEN "your_token_here"    # Windows

# Method 2: Config file
mkdir -p ~/.huggingface
echo "your_token_here" > ~/.huggingface/token

# Method 3: Enter in GUI

Voxtral API Setup

# Get Mistral API key from https://console.mistral.ai/

# Set API key (choose one method):
# Method 1: Environment variable
export MISTRAL_API_KEY="your_key_here"  # Linux/macOS
setx MISTRAL_API_KEY "your_key_here"    # Windows

# Method 2: Config file
mkdir -p ~/.mistral
echo "your_key_here" > ~/.mistral/api_key

# Method 3: Enter in GUI

πŸš€ Usage

GUI Application

# Start the application
python main.py

# Or as a module
python -m susurrus

Basic Workflow:

  1. Select Audio Source: Choose file or enter URL
  2. Choose Backend: Select transcription engine
  3. Configure Options: Set language, model, device
  4. Enable Diarization (optional): Identify speakers
  5. Start Transcription: Click "Transcribe"
  6. Save Results: Export to TXT, SRT, or VTT

Command Line Workers

Transcription Worker

python workers/transcribe_worker.py \
  --audio-input audio.mp3 \
  --backend faster-batched \
  --model-id large-v3 \
  --language en \
  --device auto

Diarization Worker

python workers/diarize_worker.py \
  --audio-input audio.mp3 \
  --hf-token YOUR_TOKEN \
  --transcribe \
  --model-id base \
  --backend faster-batched \
  --output-formats txt,srt,vtt

Python API

# Transcription backend example
from workers.transcription.backends import get_backend

backend = get_backend(
    'faster-batched',
    model_id='large-v3',
    device='auto',
    language='en'
)

for start, end, text in backend.transcribe('audio.mp3'):
    print(f"[{start:.2f}s -> {end:.2f}s] {text}")
# Diarization example
from backends.diarization import DiarizationManager

manager = DiarizationManager(hf_token="YOUR_TOKEN")
segments, files = manager.diarize_and_split('audio.mp3')

for segment in segments:
    print(f"{segment['speaker']}: {segment['text']}")

πŸ§ͺ Development

Architecture Overview

susurrus/
β”œβ”€β”€ main.py                    # Application entry point
β”œβ”€β”€ config.py                  # Central configuration
β”œβ”€β”€ backends/                  # Transcription & diarization backends
β”‚   β”œβ”€β”€ diarization/          # Speaker diarization module
β”‚   β”‚   β”œβ”€β”€ manager.py        # Diarization orchestration
β”‚   β”‚   └── progress.py       # Enhanced progress tracking
β”‚   └── transcription/        # Transcription backends
β”‚       β”œβ”€β”€ voxtral_local.py  # Voxtral local inference
β”‚       └── voxtral_api.py    # Voxtral API integration
β”œβ”€β”€ gui/                       # User interface components
β”‚   β”œβ”€β”€ main_window.py        # Main application window
β”‚   β”œβ”€β”€ widgets/              # Custom widgets
β”‚   β”‚   β”œβ”€β”€ collapsible_box.py
β”‚   β”‚   β”œβ”€β”€ diarization_settings.py
β”‚   β”‚   β”œβ”€β”€ voxtral_settings.py
β”‚   β”‚   └── advanced_options.py
β”‚   └── dialogs/              # Dialog windows
β”‚       β”œβ”€β”€ dependencies_dialog.py
β”‚       β”œβ”€β”€ installer_dialog.py
β”‚       └── cuda_diagnostics_dialog.py
β”œβ”€β”€ workers/                   # Background processing
β”‚   β”œβ”€β”€ transcription_thread.py    # GUI thread wrapper
β”‚   β”œβ”€β”€ transcribe_worker.py       # Standalone transcription worker
β”‚   β”œβ”€β”€ diarize_worker.py          # Standalone diarization worker
β”‚   └── transcription/             # Transcription backend implementations
β”‚       β”œβ”€β”€ backends/
β”‚       β”‚   β”œβ”€β”€ base.py           # Base backend interface
β”‚       β”‚   β”œβ”€β”€ mlx_backend.py
β”‚       β”‚   β”œβ”€β”€ faster_whisper_backend.py
β”‚       β”‚   β”œβ”€β”€ transformers_backend.py
β”‚       β”‚   β”œβ”€β”€ whisper_cpp_backend.py
β”‚       β”‚   β”œβ”€β”€ ctranslate2_backend.py
β”‚       β”‚   β”œβ”€β”€ whisper_jax_backend.py
β”‚       β”‚   β”œβ”€β”€ insanely_fast_backend.py
β”‚       β”‚   β”œβ”€β”€ openai_whisper_backend.py
β”‚       β”‚   └── voxtral_backend.py
β”‚       └── utils.py
β”œβ”€β”€ utils/                     # Utility modules
β”‚   β”œβ”€β”€ device_detection.py   # CUDA/MPS/CPU detection
β”‚   β”œβ”€β”€ audio_utils.py        # Audio processing utilities
β”‚   β”œβ”€β”€ download_utils.py     # URL downloading
β”‚   β”œβ”€β”€ dependency_check.py   # Dependency verification
β”‚   └── format_utils.py       # Time formatting utilities
β”œβ”€β”€ models/                    # Model configuration
β”‚   └── model_config.py       # Model mappings & utilities
└── scripts/                   # Standalone utility scripts
    β”œβ”€β”€ test_voxtral.py       # Voxtral testing
    └── pyannote_torch26.py   # PyTorch 2.6+ compatibility

Running Tests

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_backends.py

# Run with coverage
pytest --cov=. --cov-report=html

Code Quality

# Format code
black .

# Lint
flake8 .
pylint susurrus/

# Type checking
mypy .

Adding a New Backend

  1. Create a new file in workers/transcription/backends/
  2. Inherit from TranscriptionBackend
  3. Implement required methods:
    class MyBackend(TranscriptionBackend):
        def transcribe(self, audio_path):
            # Yield (start, end, text) tuples
            pass
        
        def preprocess_audio(self, audio_path):
            # Optional preprocessing
            return audio_path
        
        def cleanup(self):
            # Optional cleanup
            pass
  4. Register in workers/transcription/backends/__init__.py
  5. Add to config.py BACKEND_MODEL_MAP

πŸ”§ Configuration

Settings Location

  • Windows: %APPDATA%\Susurrus\AudioTranscription.ini
  • macOS: ~/Library/Preferences/com.Susurrus.AudioTranscription.plist
  • Linux: ~/.config/Susurrus/AudioTranscription.conf

Environment Variables

  • HF_TOKEN: Hugging Face API token (diarization)
  • MISTRAL_API_KEY: Mistral AI API key (Voxtral)
  • PYTORCH_MPS_HIGH_WATERMARK_RATIO: MPS memory optimization
  • CUDA_VISIBLE_DEVICES: GPU selection

πŸ“Š Performance Tips

GPU Acceleration

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify CUDA is available
python -c "import torch; print(torch.cuda.is_available())"

Apple Silicon Optimization

# Use MLX backend for best performance
pip install mlx-whisper

# Or use MPS device with other backends
# Will auto-detect in GUI

Memory Management

  • Use smaller models for limited RAM
  • Enable chunking for long audio files
  • Use faster-batched backend with appropriate batch size
  • Close other applications during processing

πŸ› Troubleshooting

Common Issues

"No module named 'X'"

pip install X

FFmpeg not found

# Verify installation
ffmpeg -version

# Add to PATH if needed (Windows)
setx PATH "%PATH%;C:\path\to\ffmpeg\bin"

CUDA errors

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Use Tools > CUDA Diagnostics in GUI for detailed info

Diarization authentication fails

# Verify token
python -c "from huggingface_hub import HfApi; HfApi().whoami(token='YOUR_TOKEN')"

# Accept license
# Visit: https://huggingface.co/pyannote/speaker-diarization

PyTorch 2.6+ compatibility issues

# Run the compatibility script
python scripts/pyannote_torch26.py

Development Setup

# Fork and clone
git clone https://github.com/YOUR_USERNAME/Susurrus.git
cd Susurrus

# Create feature branch
git checkout -b feature-name

# Install dev dependencies
pip install -r requirements-dev.txt

# Make changes and test
pytest tests/

# Submit PR

πŸ™ Acknowledgements

About

speech to text gui for different (mostly Whisper, also Voxtral) models and backends, including whisper.cpp, mlx-whisper, faster-whisper, ctranslate2; applies pyannote for diarization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published