Summary
Voicebox currently supports 23 languages across all engines, but Croatian (and most South Slavic languages) are not yet covered. The closest available are Polish (Chatterbox Multilingual) and Russian (Qwen3) — neither produces acceptable phonetics for Croatian.
Why it matters
Croatian has ~5M native speakers and there is currently no production-grade local-first Croatian TTS in any open-source desktop app. Realistic use cases: audiobook production, accessibility for visually impaired users, indie game dialogue, language-learning content, podcast tooling. Voicebox would be the first.
Existing Croatian resources (to save the maintainer/contributor work)
No one needs to train a Croatian TTS from scratch — there is already non-zero groundwork available. Three realistic integration paths:
Option A — wrap an existing Croatian checkpoint
-
nikolab/speecht5_tts_hr — https://huggingface.co/nikolab/speecht5_tts_hr
- Fine-tuned Microsoft SpeechT5, MIT license
- Trained on VoxPopuli HR: 43 hours, 83 speakers, ~250k tokens
- Multi-speaker (83 trained voices) but not zero-shot cloning — fits Voicebox's existing
voice_type = "preset" profile flow (same pattern Kokoro uses, see backend/database/models.py)
- Known limitation: chokes on sequences > ~20 words → Voicebox's existing auto-chunker should handle this transparently
- ~0.1B params / ~400 MB download
-
facebook/mms-tts-hrv — Meta MMS, ISO 639-3 code hrv
- VITS architecture, single-speaker baseline quality
- License compatibility to be verified by maintainers before integration
Option B — fine-tune a multilingual cloning model on VoxPopuli HR
- Base: Coqui XTTS v2 (16 languages, Croatian NOT among them — confirmed against https://huggingface.co/coqui/XTTS-v2)
- Add-a-language fine-tune on the same 43 h VoxPopuli HR corpus that
nikolab used → would give zero-shot cloning + Croatian in one model
- Community precedent: https://github.com/ylacombe/finetune-hf-vits is a maintained Hugging Face recipe for fine-tuning VITS/MMS, directly applicable
- This aligns with the existing roadmap entry in
README.md → "More Models: XTTS, Bark, and other open-source voice models"
Option C — ship a "fuzzy" intermediate option
Temporary: let users pick Chatterbox Multilingual and type Croatian text with a new hr code that internally maps to Polish phonemes. Imperfect but better than nothing. Could ship as an "experimental" badge while Option A or B matures.
Dataset availability
- VoxPopuli HR — 43 h, publicly downloadable (already used by the
nikolab model above)
- Mozilla Common Voice Croatian — effectively empty in the current release (~0.01 h, 1 speaker). Not viable as a training source.
- Total realistic pool — ~40–60 h if combined with smaller academic corpora. Enough for fine-tuning, not for from-scratch SOTA.
Offer to help
I'm a native Croatian speaker and a Voicebox user. I'm not a Python/Rust developer, but I'm happy to contribute concretely:
- Test all three options on native Croatian sentences including the harder phonemes (č, ć, š, ž, đ, dž, lj, nj) and common English loanwords ("pizza" → /pitsa/, "software" → /softver/) which multilingual models typically mispronounce
- Provide reference Croatian audio samples for QA across standard Shtokavian and one regional variant
- Record a small evaluation set (~30 min) of clean studio audio, released under CC-BY so it can live in the repo as a regression-test fixture
- Test pre-release builds on macOS and Windows
- Translate the UI to Croatian (
app/src/i18n/locales/hr/) once an engine ships
Happy to coordinate over Discussions or this issue thread. Thanks for the great project!
Summary
Voicebox currently supports 23 languages across all engines, but Croatian (and most South Slavic languages) are not yet covered. The closest available are Polish (Chatterbox Multilingual) and Russian (Qwen3) — neither produces acceptable phonetics for Croatian.
Why it matters
Croatian has ~5M native speakers and there is currently no production-grade local-first Croatian TTS in any open-source desktop app. Realistic use cases: audiobook production, accessibility for visually impaired users, indie game dialogue, language-learning content, podcast tooling. Voicebox would be the first.
Existing Croatian resources (to save the maintainer/contributor work)
No one needs to train a Croatian TTS from scratch — there is already non-zero groundwork available. Three realistic integration paths:
Option A — wrap an existing Croatian checkpoint
nikolab/speecht5_tts_hr— https://huggingface.co/nikolab/speecht5_tts_hrvoice_type = "preset"profile flow (same pattern Kokoro uses, seebackend/database/models.py)facebook/mms-tts-hrv— Meta MMS, ISO 639-3 codehrvOption B — fine-tune a multilingual cloning model on VoxPopuli HR
nikolabused → would give zero-shot cloning + Croatian in one modelREADME.md→ "More Models: XTTS, Bark, and other open-source voice models"Option C — ship a "fuzzy" intermediate option
Temporary: let users pick Chatterbox Multilingual and type Croatian text with a new
hrcode that internally maps to Polish phonemes. Imperfect but better than nothing. Could ship as an "experimental" badge while Option A or B matures.Dataset availability
nikolabmodel above)Offer to help
I'm a native Croatian speaker and a Voicebox user. I'm not a Python/Rust developer, but I'm happy to contribute concretely:
app/src/i18n/locales/hr/) once an engine shipsHappy to coordinate over Discussions or this issue thread. Thanks for the great project!