Kani TTS

A modular Human-Like TTS Model that generates high-quality speech from text input.

Features

450M Parameters: optimized for edge devices and affordable servers.
High-Quality Speech: 22kHz, 0.6kbps compression.

Installation

Prerequisites

# Core dependencies
pip install torch librosa soundfile numpy huggingface_hub
pip install "nemo_toolkit[tts]"

# CRITICAL: Custom transformers build required for "lfm2" model type
pip install -U "git+https://github.com/huggingface/transformers.git"

# Optional: For web interface
pip install fastapi uvicorn

Quick Start

# Generate audio with default sample text
python basic/main.py

# Generate audio with custom text
python basic/main.py --prompt "Hello world! My name is Kani, I'm a speech generation model!"

This will:

Load the TTS model
Generate speech from the provided text (or built-in sample text if no prompt given)
Save audio as generated_audio_YYYYMMDD_HHMMSS.wav

Web Interface

For a browser-based interface with real-time audio playback:

# Start the FastAPI server
python fastapi_example/server.py

# Open fastapi_example/client.html in your web browser
# Server runs on http://localhost:8000

The web interface provides:

Interactive text input with example prompts
Parameter adjustment (temperature, max tokens)
Real-time audio generation and playback
Download functionality for generated audio
Server health monitoring

Configuration

Default configuration uses:

Model: https://huggingface.co/nineninesix/kani-tts-450m-0.1-pt
Sample Rate: 22,050 Hz
Generation: 1200 max tokens, temperature 1.4

Model Variants

Choose different models for specific voice characteristics:

Base model (default): nineninesix/kani-tts-450m-0.1-pt
Female voice: nineninesix/kani-tts-450m-0.2-ft
Male voice: nineninesix/kani-tts-450m-0.1-ft

Base model generates random voices

To use a different model, modify the class ModelConfig in config.py.

Examples

Text	Audio
I do believe Marsellus Wallace, MY husband, YOUR boss, told you to take me out and do WHATEVER I WANTED.	Play
What do we say the the god of death? Not today!	Play
What do you call a lawyer with an IQ of 60? Your honor	Play
You mean, let me understand this cause, you know maybe it's me, it's a little fucked up maybe, but I'm funny how, I mean funny like I'm a clown, I amuse you? I make you laugh, I'm here to fucking amuse you?	Play

Architecture

The system uses a layered architecture with clear separation of concerns:

Configuration Layer: Centralized settings for models and audio processing
Token Management: Handles special tokens for speech/text boundaries
Audio Processing: Strategy pattern for different codec implementations
Model Inference: Text-to-token generation with the LLM
Audio Extraction: Validates and processes audio codes from token sequences

Tested on

NVIDIA GeForce RTX 5080
Driver Version: 570.169
CUDA Version: 12.8
16GB GPU memory
Python: 3.12
Transformers: 4.57.0.dev0

Inference speed

In order to generate 15sec audio it takes ~1sec and ~2Gb GPU VRAM

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
basic		basic
fastapi_example		fastapi_example
kanitts		kanitts
public		public
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kani TTS

Features

Installation

Prerequisites

Quick Start

Web Interface

Configuration

Model Variants

Examples

Architecture

Tested on

Inference speed

About

Uh oh!

Releases

Packages

Languages

License

BestofthebestinAI/kani-tts

Folders and files

Latest commit

History

Repository files navigation

Kani TTS

Features

Installation

Prerequisites

Quick Start

Web Interface

Configuration

Model Variants

Examples

Architecture

Tested on

Inference speed

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages