Releases: devnen/Dia-TTS-Server
Releases · devnen/Dia-TTS-Server
v1.4.0: Chunking, Predefined Voices, Enhanced Cloning & Performance
This release introduces major features for handling long text, providing consistent voices, and improving performance, along with significant enhancements to voice cloning and configuration management.
🚀 New Features:
- Large Text Processing (Chunking): Automatically splits long text inputs based on sentence structure and speaker tags (
[S1]/[S2]), enabling generation for documents of any length. Configurable via UI/API (split_text,chunk_size). - Predefined Voices: Added 43 ready-to-use, curated synthetic voices located in the
./voicesdirectory. Selectable in the UI for consistent, high-quality output without cloning setup. Server automatically handles required transcripts. - Enhanced Voice Cloning: Improved backend pipeline with automatic reference audio processing (mono conversion, resampling, truncation) and transcript handling (prioritizes local
.txtfile over experimental Whisper fallback). Backend now handles transcript prepending automatically. - Whisper Integration: Added
openai-whisperas an experimental fallback for automatic transcript generation during cloning if a.txtfile is missing. - Generation Seed: Added
seedparameter (UI/API) to influence generation. Using a fixed integer seed with Predefined/Cloned voices enhances consistency across chunks or separate generations. - API Enhancements:
/ttsendpoint now supportstranscript(for explicit clone transcript),split_text,chunk_size, andseed./v1/audio/speechendpoint now supportsseed.
- Terminal Progress: Long text generation using chunking now displays a
tqdmprogress bar in the terminal. - UI Configuration Management: Added UI section to view/edit
config.yamlsettings and save generation defaults. - Configuration System: Migrated to
config.yamlfor primary runtime configuration..envis now used mainly for initial seeding or resetting defaults via the UI.
🔧 Fixes & Enhancements:
- VRAM Usage Fixed & Optimized: Resolved memory leaks and significantly reduced VRAM usage (approx. 14GB+ down to ~7GB) through code optimizations and BF16 default.
- Performance: Significant speed improvements reported (approaching 95% real-time on tested hardware: AMD Ryzen 9 9950X3D + NVIDIA RTX 3090).
- Audio Post-Processing: Automatically applies silence trimming, internal silence fixing, and unvoiced segment removal (using Parselmouth) to improve audio quality and remove artifacts.
- UI State Persistence: Web UI now saves/restores settings (text, mode, files, parameters) in
config.yaml. - UI Improvements: Better loading indicators, refined chunking controls, seed input, theme toggle, dynamic preset loading from
ui/presets.yaml, warning modals. - Cloning Workflow: Backend now handles transcript prepending automatically; UI workflow simplified.
- Dependency Management: Added
tqdm,PyYAML,openai-whisper,parselmouth. - Code Refactoring: Aligned internal engine code with refactored
dialibrary structure; updatedconfig.pyto useYamlConfigManager.
Note: The configuration system has changed significantly. Settings are now primarily managed via config.yaml. See the documentation for details.
v1.0.0: Initial release
documentation, docker changes