Skip to content

Releases: devnen/Dia-TTS-Server

v1.4.0: Chunking, Predefined Voices, Enhanced Cloning & Performance

29 Apr 02:36

Choose a tag to compare

This release introduces major features for handling long text, providing consistent voices, and improving performance, along with significant enhancements to voice cloning and configuration management.

🚀 New Features:

  • Large Text Processing (Chunking): Automatically splits long text inputs based on sentence structure and speaker tags ([S1]/[S2]), enabling generation for documents of any length. Configurable via UI/API (split_text, chunk_size).
  • Predefined Voices: Added 43 ready-to-use, curated synthetic voices located in the ./voices directory. Selectable in the UI for consistent, high-quality output without cloning setup. Server automatically handles required transcripts.
  • Enhanced Voice Cloning: Improved backend pipeline with automatic reference audio processing (mono conversion, resampling, truncation) and transcript handling (prioritizes local .txt file over experimental Whisper fallback). Backend now handles transcript prepending automatically.
  • Whisper Integration: Added openai-whisper as an experimental fallback for automatic transcript generation during cloning if a .txt file is missing.
  • Generation Seed: Added seed parameter (UI/API) to influence generation. Using a fixed integer seed with Predefined/Cloned voices enhances consistency across chunks or separate generations.
  • API Enhancements:
    • /tts endpoint now supports transcript (for explicit clone transcript), split_text, chunk_size, and seed.
    • /v1/audio/speech endpoint now supports seed.
  • Terminal Progress: Long text generation using chunking now displays a tqdm progress bar in the terminal.
  • UI Configuration Management: Added UI section to view/edit config.yaml settings and save generation defaults.
  • Configuration System: Migrated to config.yaml for primary runtime configuration. .env is now used mainly for initial seeding or resetting defaults via the UI.

🔧 Fixes & Enhancements:

  • VRAM Usage Fixed & Optimized: Resolved memory leaks and significantly reduced VRAM usage (approx. 14GB+ down to ~7GB) through code optimizations and BF16 default.
  • Performance: Significant speed improvements reported (approaching 95% real-time on tested hardware: AMD Ryzen 9 9950X3D + NVIDIA RTX 3090).
  • Audio Post-Processing: Automatically applies silence trimming, internal silence fixing, and unvoiced segment removal (using Parselmouth) to improve audio quality and remove artifacts.
  • UI State Persistence: Web UI now saves/restores settings (text, mode, files, parameters) in config.yaml.
  • UI Improvements: Better loading indicators, refined chunking controls, seed input, theme toggle, dynamic preset loading from ui/presets.yaml, warning modals.
  • Cloning Workflow: Backend now handles transcript prepending automatically; UI workflow simplified.
  • Dependency Management: Added tqdm, PyYAML, openai-whisper, parselmouth.
  • Code Refactoring: Aligned internal engine code with refactored dia library structure; updated config.py to use YamlConfigManager.

Note: The configuration system has changed significantly. Settings are now primarily managed via config.yaml. See the documentation for details.

v1.0.0: Initial release

28 Apr 19:49

Choose a tag to compare

documentation, docker changes