Sonori

A lightweight, transparent overlay application for local AI-powered speech transcription on Linux. Choose between real-time or on-demand manual transcription modes.

Contributing

Contributions are welcome and encouraged! Whether you're fixing bugs, adding features, improving documentation, or testing on different distributions, your help is appreciated.

Getting Started:

Check out ARCHITECTURE.md to understand the codebase structure
Look at the planned features and known issues below for ideas
Test your changes on your distribution (we aim to support NixOS and other major distros)
Open an issue or PR - no formal guidelines yet, just make sure it works!

Note: The application is in active development. You may encounter bugs or instability as new features are added.

Features

Current

Local AI Processing: All transcription happens on your device - no cloud services required
Multi-Backend Support: Choose between CTranslate2, Whisper.cpp, and other local AI backends
Dual Transcription Modes: Real-time continuous transcription or manual on-demand sessions
GPU Acceleration: Accelerate transcription using Vulkan (no CUDA yet and only works using the whisper_cpp backend)
Voice Activity Detection: Uses Silero VAD for accurate speech detection
Transparent Overlay: Non-intrusive overlay that sits at the bottom of your screen
Audio Visualization: Visual feedback when speaking with a spectrogram display
Copy/Paste Functionality: Easily copy transcribed text to clipboard
Pause/Resume Recording: Pause/Resume recording (real-time mode) or Start/Stop sessions (manual mode)
Auto-Start Recording: Begins recording automatically in real-time mode (manual mode requires manual start)
Scroll Controls: Navigate through longer transcripts
CLI Mode: Run without GUI in terminal mode using --cli flag for headless usage
Sound Feedback: Optional audio cues for recording state changes
Configurable: Configure the backend, model, language, transcription mode, and other settings in the config file (config.toml)
Automatic Model Download: Models are downloaded automatically based on selected backend
Performance Monitoring: Optional statistics logging for transcription performance analysis
Global Shortcuts: Optional XDG Desktop Portal integration for system-wide hotkeys (e.g., Super+backslash to toggle manual sessions)
Portal Input: Optional automatic pasting via XDG Desktop Portal for seamless text injection
System Tray Integration: Quick access via system tray with window control and status display
Display Configuration: VSync and frame rate control for optimized rendering
Window Behavior Control: Auto-hide, window positioning, and system tray integration options

Planned

Better error handling: Handle errors gracefully and provide useful error messages
Better UI: A better UI with a focus on more usability
Additional Local AI Backends: Support for other specialized local transcription models
CUDA Support: Enhanced GPU acceleration across all backends
Cloud API Support: Optional integration with cloud providers (Deepgram, OpenAI) for users who prefer cloud processing

NOT Planned

Using a GUI framework: I want to learn more about wgpu and wgsl and think a GUI written from scratch is perfectly fine for this application
Support for Windows/macOS: Not planned by me personally but if anyone wants to give it a shot feel free

Requirements

Platform: Linux only (x86_64, aarch64)

Dependencies

Note: Primarily tested on NixOS, but should work on other Linux distributions with proper dependencies installed. Feedback on other distros is welcome!

For Debian/Ubuntu-based distributions:

Ubuntu 24.04+ (Noble and later):

sudo apt install build-essential portaudio19-dev libclang-dev pkg-config wl-copy \
  libxkbcommon-dev libwayland-dev libx11-dev libxcursor-dev libxi-dev libxrandr-dev \
  libasound2-dev libssl-dev libfftw3-dev curl cmake libvulkan-dev \
  libopenblas-dev glslc

Ubuntu 22.04 and earlier: Note: glslc is not available in standard repositories. You'll need to either:

Upgrade to Ubuntu 24.04, or
Download glslc from LunarG Vulkan SDK, or
Build shaderc from source

For Fedora/RHEL-based distributions:

sudo dnf install gcc gcc-c++ portaudio-devel clang-devel pkg-config wl-copy \
  libxkbcommon-devel wayland-devel libX11-devel libXcursor-devel libXi-devel libXrandr-devel \
  alsa-lib-devel openssl-devel fftw-devel curl cmake vulkan-loader-devel vulkan-headers \
  openblas-devel shaderc

For Arch-based distributions:

sudo pacman -S base-devel portaudio clang pkgconf wl-copy \
  libxkbcommon wayland libx11 libxcursor libxi libxrandr alsa-lib openssl fftw curl cmake \
  vulkan-headers vulkan-tools blas shaderc

For NixOS:

Simply use the provided flake.nix by running

nix develop

while in the root directory of the repository. The flake includes all necessary dependencies including vulkan-loader.

Required Models

Sonori needs models to function properly, depending on the selected backend:

Transcription Model - Downloaded automatically based on backend selection:
- CTranslate2: Hugging Face models converted to CTranslate2 format
- Whisper.cpp: GGML format models from whisper.cpp repository
Silero VAD Model - Downloaded automatically on first run (shared across all backends)

Note: If you need to download the Silero model manually for any reason, you can get it from: https://github.com/snakers4/silero-vad/ And place it in ~/.cache/sonori/models/

Additional Requirements

ONNX Runtime: Required for the Silero VAD model

Ubuntu/Debian: Not available in standard repos. Download from GitHub releases:

ONNX_VERSION=1.20.0
wget https://github.com/microsoft/onnxruntime/releases/download/v${ONNX_VERSION}/onnxruntime-linux-x64-${ONNX_VERSION}.tgz
tar -xzf onnxruntime-linux-x64-${ONNX_VERSION}.tgz
sudo cp -r onnxruntime-linux-x64-${ONNX_VERSION}/include/* /usr/local/include/
sudo cp -r onnxruntime-linux-x64-${ONNX_VERSION}/lib/* /usr/local/lib/
sudo ldconfig

NixOS: Included in development environment via nix develop

CTranslate2: Used for CTranslate2 backend inference
whisper-rs: Used for Whisper.cpp backend inference
OpenBLAS: Required for Whisper.cpp CPU optimization. For better performance on modern CPUs, ensure this is installed
CPAL: Required for sound feedback system
Vulkan: Required for WGPU rendering and optional GPU acceleration in Whisper.cpp. Your system must have:
- Vulkan loader and headers
- Shader compiler (shaderc) for Vulkan GPU compilation

Installation

NixOS (Recommended)

Try without installing:

nix run github:0xPD33/sonori

Install to profile:

nix profile install github:0xPD33/sonori

Add to configuration.nix:

{
  inputs.sonori.url = "github:0xPD33/sonori";

  # In your system configuration:
  environment.systemPackages = [ inputs.sonori.packages.${system}.default ];
}

From Releases

Download the latest tarball from GitHub Releases
Extract: tar -xzf sonori-*.tar.gz
Run: ./sonori-*/sonori

Building from Source

Requirements: Install Rust and Cargo from https://rustup.rs/

Arch/Manjaro

sudo pacman -S base-devel portaudio clang pkgconf wl-copy \
  libxkbcommon wayland libx11 libxcursor libxi libxrandr alsa-lib openssl fftw curl cmake \
  vulkan-headers vulkan-tools blas shaderc

Fedora/RHEL

sudo dnf install gcc gcc-c++ portaudio-devel clang-devel pkg-config wl-copy \
  libxkbcommon-devel wayland-devel libX11-devel libXcursor-devel libXi-devel libXrandr-devel \
  alsa-lib-devel openssl-devel fftw-devel curl cmake vulkan-loader-devel vulkan-headers \
  openblas-devel shaderc

Debian/Ubuntu

Ubuntu 24.04+:

sudo apt install build-essential portaudio19-dev libclang-dev pkg-config wl-copy \
  libxkbcommon-dev libwayland-dev libx11-dev libxcursor-dev libxi-dev libxrandr-dev \
  libasound2-dev libssl-dev libfftw3-dev curl cmake libvulkan-dev \
  libopenblas-dev glslc

Then install ONNX Runtime (see Additional Requirements section above).

Ubuntu 22.04: See notes above about glslc availability and ONNX Runtime installation

NixOS

nix develop

Build:

git clone https://github.com/0xPD33/sonori
cd sonori
cargo build --release
./target/release/sonori

Desktop Integration

To integrate Sonori with your application menu and system:

For NixOS: Desktop integration is automatic via the Nix flake.

For other distributions:

# User installation (recommended)
./install-desktop.sh --user

# System-wide installation (requires root)
sudo ./install-desktop.sh --system

This installs:

Application menu entry (.desktop file)
AppStream metadata for software centers
Application icon

See desktop/README.md for detailed instructions and manual installation steps.

Usage

GUI Mode (Default)

Launch the application:
```
./target/release/sonori
```
A transparent overlay will appear at the bottom of your screen
In real-time mode, recording starts automatically; in manual mode, press Record to start sessions
Speak naturally - your speech will be transcribed in real-time or near real-time (based on the model and hardware)
Use the buttons on the overlay to:
- Pause/Resume recording (real-time mode)
- Start/Stop manual sessions and Accept transcript (manual mode)
- Copy text to clipboard
- Clear transcript history
- Toggle between real-time and manual modes
- Exit the application

For manual mode, start a session with the Record button, speak, then stop and accept to transcribe the accumulated audio.

CLI Mode

For headless usage or terminal-based transcription:

Launch in CLI mode:
```
./target/release/sonori --cli
```
Transcription will appear directly in your terminal
In real-time mode, recording starts automatically; in manual mode, use spacebar to start/stop sessions
Speak naturally - transcriptions will update in real-time on the same line (real-time mode) or after session acceptance (manual mode)
Press Ctrl+C to exit gracefully

Command Line Options

--cli: Run in CLI mode without GUI
--mode <realtime|manual>: Set transcription mode (default: manual)
--manual: Shorthand for --mode manual to start in manual transcription mode
--help: Show help information
--version: Display version information

Configuration

Sonori uses a config.toml file for configuration. The defaults work well for most users - you typically only need to change 2-3 settings.

Quick Setup: Most users just need to choose a configuration from the Configuration Guide and copy it to config.toml.

Common Choices:

Fast & Lightweight: Good for older computers
Balanced Performance: Recommended for most users
High Quality: For powerful computers with GPU
Real-Time: Live transcription as you speak
Multilingual: For non-English languages

See the complete configuration guide for all examples and advanced settings.

Known Issues

The application might not work with all Wayland compositors (I only tested it with KDE Plasma and KWin).
The transcriptions are not 100% accurate and might contain errors. This is closely related to the whisper model that is used.
30-second transcription truncation: Recordings exactly 30 seconds long may get truncated. This is a known architectural limitation of Whisper models, not a bug. Whisper uses 30-second processing windows with a 448 token limit - dense speech can exhaust this limit before the full 30 seconds are transcribed. See Troubleshooting section for solutions.
The CPU usage is too high, even when idle. This might be related to bad code on my side or some overhead of the models. I already identified that changing the buffer size will help (or make it worse).

Troubleshooting

Wayland Support

Sonori uses layer shell protocol for Wayland compositors. If you experience issues:

Make sure you are in a wayland session and your compositor supports the layer shell protocol

Vulkan Support

Sonori uses WGPU for rendering and has the ability to accelerate transcription using the GPU, which requires Vulkan support. If you encounter errors related to adapter detection or Vulkan:

Ensure you have the Vulkan libraries installed for your distribution (see Dependencies section)
Verify that your GPU supports Vulkan and that drivers are properly installed
On some systems, you may need to install additional vendor-specific Vulkan packages (e.g., mesa-vulkan-drivers on Ubuntu/Debian)
You can test Vulkan support by running vulkaninfo or vkcube if available on your system

GPU Acceleration (Whisper.cpp Backend)

If GPU acceleration is enabled but not working:

Ensure gpu_enabled = true in [backend_config] section
Verify that your system has Vulkan support (see Vulkan Support section above)
Check that shaderc is properly installed (required for shader compilation)
For NVIDIA GPUs: ensure CUDA drivers are installed and up-to-date
For AMD/Intel: ensure appropriate Vulkan drivers are installed
If compilation fails with shader errors, try disabling GPU acceleration and using CPU mode instead
Monitor GPU usage with nvidia-smi (NVIDIA) or rocm-smi (AMD) while transcribing

Model Conversion Issues

If you encounter issues with automatic model conversion:

For NixOS:

nix-shell model-conversion/shell.nix
ct2-transformers-converter --model your-model --output_dir ~/.cache/whisper/your-model --copy_files preprocessor_config.json tokenizer.json

For other distributions:

pip install -U ctranslate2 huggingface_hub torch transformers
ct2-transformers-converter --model your-model --output_dir ~/.cache/whisper/your-model --copy_files preprocessor_config.json tokenizer.json

30-Second Transcription Truncation

If you experience transcription cutoffs with recordings exactly 30 seconds long, this is due to Whisper's architectural limitations:

Root Cause: Whisper models process audio in 30-second windows with a 448 token limit. Dense speech can exhaust this limit before the full 30 seconds are transcribed.

Solutions:

Keep recordings under 30 seconds (simplest): For manual mode, try to keep your recordings around 25 seconds or less to avoid this boundary entirely.
Adjust chunk settings (recommended):

[manual_mode_config]
chunk_duration_seconds = 20.0    # Experiment with values between 15-25
chunk_overlap_seconds = 2.0      # Overlap helps prevent word cutoff

Switch to CTranslate2 backend:

[backend_config]
backend = "ctranslate2"

Try different chunk_duration_seconds values to find what works best for your speech patterns and content density.

Platform Support

Supported:

Linux x86_64 (64-bit Intel/AMD)
Linux aarch64 (64-bit ARM)

Tested on:

NixOS with KDE Plasma/KWin (Wayland)
Other major Linux distributions should work with proper dependencies

Not supported:

Windows
macOS
32-bit architectures

Note: While primarily developed and tested on NixOS, Sonori should work on other Linux distributions with the proper dependencies installed. Feedback and testing on other distros is welcome!

Credits

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github/workflows		.github/workflows
assets		assets
desktop		desktop
model-conversion		model-conversion
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONFIGURATION.md		CONFIGURATION.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
cliff.toml		cliff.toml
config.toml		config.toml
flake.lock		flake.lock
flake.nix		flake.nix
install-desktop.sh		install-desktop.sh
rust-toolchain.toml		rust-toolchain.toml

License

0xPD33/sonori

Folders and files

Latest commit

History

Repository files navigation

Sonori

Contributing

Features

Current

Planned

NOT Planned

Requirements

Dependencies

Required Models

Additional Requirements

Installation

NixOS (Recommended)

From Releases

Building from Source

Arch/Manjaro

Fedora/RHEL

Debian/Ubuntu

NixOS

Desktop Integration

Usage

GUI Mode (Default)

CLI Mode

Command Line Options

Configuration

Common Choices:

Known Issues

Troubleshooting

Wayland Support

Vulkan Support

GPU Acceleration (Whisper.cpp Backend)

Model Conversion Issues

30-Second Transcription Truncation

Platform Support

Credits

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Languages

Packages