A lightweight, transparent overlay application for local AI-powered speech transcription on Linux. Choose between real-time or on-demand manual transcription modes.
Contributions are welcome and encouraged! Whether you're fixing bugs, adding features, improving documentation, or testing on different distributions, your help is appreciated.
Getting Started:
- Check out ARCHITECTURE.md to understand the codebase structure
- Look at the planned features and known issues below for ideas
- Test your changes on your distribution (we aim to support NixOS and other major distros)
- Open an issue or PR - no formal guidelines yet, just make sure it works!
Note: The application is in active development. You may encounter bugs or instability as new features are added.
- Local AI Processing: All transcription happens on your device - no cloud services required
- Multi-Backend Support: Choose between CTranslate2, Whisper.cpp, and other local AI backends
- Dual Transcription Modes: Real-time continuous transcription or manual on-demand sessions
- GPU Acceleration: Accelerate transcription using Vulkan (no CUDA yet and only works using the
whisper_cppbackend) - Voice Activity Detection: Uses Silero VAD for accurate speech detection
- Transparent Overlay: Non-intrusive overlay that sits at the bottom of your screen
- Audio Visualization: Visual feedback when speaking with a spectrogram display
- Copy/Paste Functionality: Easily copy transcribed text to clipboard
- Pause/Resume Recording: Pause/Resume recording (real-time mode) or Start/Stop sessions (manual mode)
- Auto-Start Recording: Begins recording automatically in real-time mode (manual mode requires manual start)
- Scroll Controls: Navigate through longer transcripts
- CLI Mode: Run without GUI in terminal mode using
--cliflag for headless usage - Sound Feedback: Optional audio cues for recording state changes
- Configurable: Configure the backend, model, language, transcription mode, and other settings in the config file (config.toml)
- Automatic Model Download: Models are downloaded automatically based on selected backend
- Performance Monitoring: Optional statistics logging for transcription performance analysis
- Global Shortcuts: Optional XDG Desktop Portal integration for system-wide hotkeys (e.g., Super+backslash to toggle manual sessions)
- Portal Input: Optional automatic pasting via XDG Desktop Portal for seamless text injection
- System Tray Integration: Quick access via system tray with window control and status display
- Display Configuration: VSync and frame rate control for optimized rendering
- Window Behavior Control: Auto-hide, window positioning, and system tray integration options
- Better error handling: Handle errors gracefully and provide useful error messages
- Better UI: A better UI with a focus on more usability
- Additional Local AI Backends: Support for other specialized local transcription models
- CUDA Support: Enhanced GPU acceleration across all backends
- Cloud API Support: Optional integration with cloud providers (Deepgram, OpenAI) for users who prefer cloud processing
- Using a GUI framework: I want to learn more about wgpu and wgsl and think a GUI written from scratch is perfectly fine for this application
- Support for Windows/macOS: Not planned by me personally but if anyone wants to give it a shot feel free
Platform: Linux only (x86_64, aarch64)
Note: Primarily tested on NixOS, but should work on other Linux distributions with proper dependencies installed. Feedback on other distros is welcome!
For Debian/Ubuntu-based distributions:
Ubuntu 24.04+ (Noble and later):
sudo apt install build-essential portaudio19-dev libclang-dev pkg-config wl-copy \
libxkbcommon-dev libwayland-dev libx11-dev libxcursor-dev libxi-dev libxrandr-dev \
libasound2-dev libssl-dev libfftw3-dev curl cmake libvulkan-dev \
libopenblas-dev glslcUbuntu 22.04 and earlier:
Note: glslc is not available in standard repositories. You'll need to either:
- Upgrade to Ubuntu 24.04, or
- Download glslc from LunarG Vulkan SDK, or
- Build shaderc from source
For Fedora/RHEL-based distributions:
sudo dnf install gcc gcc-c++ portaudio-devel clang-devel pkg-config wl-copy \
libxkbcommon-devel wayland-devel libX11-devel libXcursor-devel libXi-devel libXrandr-devel \
alsa-lib-devel openssl-devel fftw-devel curl cmake vulkan-loader-devel vulkan-headers \
openblas-devel shadercFor Arch-based distributions:
sudo pacman -S base-devel portaudio clang pkgconf wl-copy \
libxkbcommon wayland libx11 libxcursor libxi libxrandr alsa-lib openssl fftw curl cmake \
vulkan-headers vulkan-tools blas shadercFor NixOS:
Simply use the provided flake.nix by running
nix developwhile in the root directory of the repository. The flake includes all necessary dependencies including vulkan-loader.
Sonori needs models to function properly, depending on the selected backend:
-
Transcription Model - Downloaded automatically based on backend selection:
- CTranslate2: Hugging Face models converted to CTranslate2 format
- Whisper.cpp: GGML format models from whisper.cpp repository
-
Silero VAD Model - Downloaded automatically on first run (shared across all backends)
Note: If you need to download the Silero model manually for any reason, you can get it from: https://github.com/snakers4/silero-vad/ And place it in
~/.cache/sonori/models/
- ONNX Runtime: Required for the Silero VAD model
- Ubuntu/Debian: Not available in standard repos. Download from GitHub releases:
ONNX_VERSION=1.20.0 wget https://github.com/microsoft/onnxruntime/releases/download/v${ONNX_VERSION}/onnxruntime-linux-x64-${ONNX_VERSION}.tgz tar -xzf onnxruntime-linux-x64-${ONNX_VERSION}.tgz sudo cp -r onnxruntime-linux-x64-${ONNX_VERSION}/include/* /usr/local/include/ sudo cp -r onnxruntime-linux-x64-${ONNX_VERSION}/lib/* /usr/local/lib/ sudo ldconfig
- NixOS: Included in development environment via
nix develop
- Ubuntu/Debian: Not available in standard repos. Download from GitHub releases:
- CTranslate2: Used for CTranslate2 backend inference
- whisper-rs: Used for Whisper.cpp backend inference
- OpenBLAS: Required for Whisper.cpp CPU optimization. For better performance on modern CPUs, ensure this is installed
- CPAL: Required for sound feedback system
- Vulkan: Required for WGPU rendering and optional GPU acceleration in Whisper.cpp. Your system must have:
- Vulkan loader and headers
- Shader compiler (shaderc) for Vulkan GPU compilation
Try without installing:
nix run github:0xPD33/sonoriInstall to profile:
nix profile install github:0xPD33/sonoriAdd to configuration.nix:
{
inputs.sonori.url = "github:0xPD33/sonori";
# In your system configuration:
environment.systemPackages = [ inputs.sonori.packages.${system}.default ];
}- Download the latest tarball from GitHub Releases
- Extract:
tar -xzf sonori-*.tar.gz - Run:
./sonori-*/sonori
Requirements: Install Rust and Cargo from https://rustup.rs/
sudo pacman -S base-devel portaudio clang pkgconf wl-copy \
libxkbcommon wayland libx11 libxcursor libxi libxrandr alsa-lib openssl fftw curl cmake \
vulkan-headers vulkan-tools blas shadercsudo dnf install gcc gcc-c++ portaudio-devel clang-devel pkg-config wl-copy \
libxkbcommon-devel wayland-devel libX11-devel libXcursor-devel libXi-devel libXrandr-devel \
alsa-lib-devel openssl-devel fftw-devel curl cmake vulkan-loader-devel vulkan-headers \
openblas-devel shadercUbuntu 24.04+:
sudo apt install build-essential portaudio19-dev libclang-dev pkg-config wl-copy \
libxkbcommon-dev libwayland-dev libx11-dev libxcursor-dev libxi-dev libxrandr-dev \
libasound2-dev libssl-dev libfftw3-dev curl cmake libvulkan-dev \
libopenblas-dev glslcThen install ONNX Runtime (see Additional Requirements section above).
Ubuntu 22.04: See notes above about glslc availability and ONNX Runtime installation
nix developBuild:
git clone https://github.com/0xPD33/sonori
cd sonori
cargo build --release
./target/release/sonoriTo integrate Sonori with your application menu and system:
For NixOS: Desktop integration is automatic via the Nix flake.
For other distributions:
# User installation (recommended)
./install-desktop.sh --user
# System-wide installation (requires root)
sudo ./install-desktop.sh --systemThis installs:
- Application menu entry (.desktop file)
- AppStream metadata for software centers
- Application icon
See desktop/README.md for detailed instructions and manual installation steps.
- Launch the application:
./target/release/sonori
- A transparent overlay will appear at the bottom of your screen
- In real-time mode, recording starts automatically; in manual mode, press Record to start sessions
- Speak naturally - your speech will be transcribed in real-time or near real-time (based on the model and hardware)
- Use the buttons on the overlay to:
- Pause/Resume recording (real-time mode)
- Start/Stop manual sessions and Accept transcript (manual mode)
- Copy text to clipboard
- Clear transcript history
- Toggle between real-time and manual modes
- Exit the application
For manual mode, start a session with the Record button, speak, then stop and accept to transcribe the accumulated audio.
For headless usage or terminal-based transcription:
- Launch in CLI mode:
./target/release/sonori --cli
- Transcription will appear directly in your terminal
- In real-time mode, recording starts automatically; in manual mode, use spacebar to start/stop sessions
- Speak naturally - transcriptions will update in real-time on the same line (real-time mode) or after session acceptance (manual mode)
- Press
Ctrl+Cto exit gracefully
--cli: Run in CLI mode without GUI--mode <realtime|manual>: Set transcription mode (default: manual)--manual: Shorthand for--mode manualto start in manual transcription mode--help: Show help information--version: Display version information
Sonori uses a config.toml file for configuration. The defaults work well for most users - you typically only need to change 2-3 settings.
Quick Setup: Most users just need to choose a configuration from the Configuration Guide and copy it to config.toml.
- Fast & Lightweight: Good for older computers
- Balanced Performance: Recommended for most users
- High Quality: For powerful computers with GPU
- Real-Time: Live transcription as you speak
- Multilingual: For non-English languages
See the complete configuration guide for all examples and advanced settings.
- The application might not work with all Wayland compositors (I only tested it with KDE Plasma and KWin).
- The transcriptions are not 100% accurate and might contain errors. This is closely related to the whisper model that is used.
- 30-second transcription truncation: Recordings exactly 30 seconds long may get truncated. This is a known architectural limitation of Whisper models, not a bug. Whisper uses 30-second processing windows with a 448 token limit - dense speech can exhaust this limit before the full 30 seconds are transcribed. See Troubleshooting section for solutions.
- The CPU usage is too high, even when idle. This might be related to bad code on my side or some overhead of the models. I already identified that changing the buffer size will help (or make it worse).
Sonori uses layer shell protocol for Wayland compositors. If you experience issues:
- Make sure you are in a wayland session and your compositor supports the layer shell protocol
Sonori uses WGPU for rendering and has the ability to accelerate transcription using the GPU, which requires Vulkan support. If you encounter errors related to adapter detection or Vulkan:
- Ensure you have the Vulkan libraries installed for your distribution (see Dependencies section)
- Verify that your GPU supports Vulkan and that drivers are properly installed
- On some systems, you may need to install additional vendor-specific Vulkan packages (e.g.,
mesa-vulkan-driverson Ubuntu/Debian) - You can test Vulkan support by running
vulkaninfoorvkcubeif available on your system
If GPU acceleration is enabled but not working:
- Ensure
gpu_enabled = truein[backend_config]section - Verify that your system has Vulkan support (see Vulkan Support section above)
- Check that shaderc is properly installed (required for shader compilation)
- For NVIDIA GPUs: ensure CUDA drivers are installed and up-to-date
- For AMD/Intel: ensure appropriate Vulkan drivers are installed
- If compilation fails with shader errors, try disabling GPU acceleration and using CPU mode instead
- Monitor GPU usage with
nvidia-smi(NVIDIA) orrocm-smi(AMD) while transcribing
If you encounter issues with automatic model conversion:
For NixOS:
nix-shell model-conversion/shell.nix
ct2-transformers-converter --model your-model --output_dir ~/.cache/whisper/your-model --copy_files preprocessor_config.json tokenizer.jsonFor other distributions:
pip install -U ctranslate2 huggingface_hub torch transformers
ct2-transformers-converter --model your-model --output_dir ~/.cache/whisper/your-model --copy_files preprocessor_config.json tokenizer.jsonIf you experience transcription cutoffs with recordings exactly 30 seconds long, this is due to Whisper's architectural limitations:
Root Cause: Whisper models process audio in 30-second windows with a 448 token limit. Dense speech can exhaust this limit before the full 30 seconds are transcribed.
Solutions:
-
Keep recordings under 30 seconds (simplest): For manual mode, try to keep your recordings around 25 seconds or less to avoid this boundary entirely.
-
Adjust chunk settings (recommended):
[manual_mode_config]
chunk_duration_seconds = 20.0 # Experiment with values between 15-25
chunk_overlap_seconds = 2.0 # Overlap helps prevent word cutoff- Switch to CTranslate2 backend:
[backend_config]
backend = "ctranslate2"Try different chunk_duration_seconds values to find what works best for your speech patterns and content density.
Supported:
- Linux x86_64 (64-bit Intel/AMD)
- Linux aarch64 (64-bit ARM)
Tested on:
- NixOS with KDE Plasma/KWin (Wayland)
- Other major Linux distributions should work with proper dependencies
Not supported:
- Windows
- macOS
- 32-bit architectures
Note: While primarily developed and tested on NixOS, Sonori should work on other Linux distributions with the proper dependencies installed. Feedback and testing on other distros is welcome!
- Rust
- CTranslate2 and Faster Whisper
- whisper.cpp and whisper-rs
- Onnx Runtime
- OpenAI Whisper
- Silero VAD
- CPAL
- Winit Fork
- WGPU
This project is licensed under the MIT License. See the LICENSE file for details.