A modular Human-Like TTS Model that generates high-quality speech from text input.
- 450M Parameters: optimized for edge devices and affordable servers.
- High-Quality Speech: 22kHz, 0.6kbps compression.
# Core dependencies
pip install torch librosa soundfile numpy huggingface_hub
pip install "nemo_toolkit[tts]"
# CRITICAL: Custom transformers build required for "lfm2" model type
pip install -U "git+https://github.com/huggingface/transformers.git"
# Optional: For web interface
pip install fastapi uvicorn# Generate audio with default sample text
python basic/main.py
# Generate audio with custom text
python basic/main.py --prompt "Hello world! My name is Kani, I'm a speech generation model!"This will:
- Load the TTS model
- Generate speech from the provided text (or built-in sample text if no prompt given)
- Save audio as
generated_audio_YYYYMMDD_HHMMSS.wav
For a browser-based interface with real-time audio playback:
# Start the FastAPI server
python fastapi_example/server.py
# Open fastapi_example/client.html in your web browser
# Server runs on http://localhost:8000The web interface provides:
- Interactive text input with example prompts
- Parameter adjustment (temperature, max tokens)
- Real-time audio generation and playback
- Download functionality for generated audio
- Server health monitoring
Default configuration uses:
- Model:
https://huggingface.co/nineninesix/kani-tts-450m-0.1-pt - Sample Rate: 22,050 Hz
- Generation: 1200 max tokens, temperature 1.4
Choose different models for specific voice characteristics:
- Base model (default):
nineninesix/kani-tts-450m-0.1-pt - Female voice:
nineninesix/kani-tts-450m-0.2-ft - Male voice:
nineninesix/kani-tts-450m-0.1-ft
Base model generates random voices
To use a different model, modify the class ModelConfig in config.py.
| Text | Audio |
|---|---|
| I do believe Marsellus Wallace, MY husband, YOUR boss, told you to take me out and do WHATEVER I WANTED. | Play |
| What do we say the the god of death? Not today! | Play |
| What do you call a lawyer with an IQ of 60? Your honor | Play |
| You mean, let me understand this cause, you know maybe it's me, it's a little fucked up maybe, but I'm funny how, I mean funny like I'm a clown, I amuse you? I make you laugh, I'm here to fucking amuse you? | Play |
The system uses a layered architecture with clear separation of concerns:
- Configuration Layer: Centralized settings for models and audio processing
- Token Management: Handles special tokens for speech/text boundaries
- Audio Processing: Strategy pattern for different codec implementations
- Model Inference: Text-to-token generation with the LLM
- Audio Extraction: Validates and processes audio codes from token sequences
- NVIDIA GeForce RTX 5080
- Driver Version: 570.169
- CUDA Version: 12.8
- 16GB GPU memory
- Python: 3.12
- Transformers: 4.57.0.dev0
In order to generate 15sec audio it takes ~1sec and ~2Gb GPU VRAM
