Replies: 1 comment
-
|
can i use your indo model |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone! 👋
I’ve trained several models so far, and honestly, F5-TTS is by far the best one I’ve worked with. I recently trained a new model called PapaRazi/Ijazah_Palsu_V2, which is an improved version of the earlier Ijazah_Palsu_V1.
The model was trained on around seventy thousand samples over the course of three days. It’s primarily trained in Bahasa Indonesia (ninety-five percent), with a smaller portion in English (five percent). I’ve noticed that the English pronunciation could use some improvement — likely due to the limited English data. I believe increasing it to around fifteen percent would yield better results.
🧪 Training Settings
{ "exp_name": "F5TTS_v1_Base", "learning_rate": 1e-05, "batch_size_per_gpu": 1700, "batch_size_type": "frame", "max_samples": 64, "grad_accumulation_steps": 1, "max_grad_norm": 1, "epochs": 34, "num_warmup_updates": 7000, "save_per_updates": 15000, "keep_last_n_checkpoints": 7, "last_per_updates": 15000, "finetune": true, "file_checkpoint_train": "", "tokenizer_type": "char", "tokenizer_file": "", "mixed_precision": "fp16", "logger": "tensorboard", "bnb_optimizer": false }📦 Dataset
The dataset I used is called PapaRazi/id-tts-v2, which I personally collected and curated to better suit Indonesian speech synthesis.
To build and process the dataset, I used a tool I developed myself:
👉 github.com/adigayung/whisper-tools
This tool helped with splitting, normalization, cleanup, and preparing training-ready data.
The built-in dataset splitter from F5-TTS didn’t quite fit my needs — it often created segments that were way too short or overly long (some over twelve minutes and one hundred megabytes, which isn't practical for training).
🔊 Voice Samples
Here are some sample outputs from the model:
📌 Natural narration example:
"Suatu hari nanti, suara ini mungkin tidak bisa dibedakan lagi dari suara manusia asli."
🎧 Listen here
📌 Simple number reading is working quite well:
"Serius?! Tiket konsernya habis dalam waktu tiga menit?!"
🎧 Listen here
📌 But for large numbers (like in the millions), the output is still quite inaccurate and sometimes hallucinates a bit —
probably because I didn’t include enough examples of that kind in the dataset.
"Masa cuma buat beli kursi kantor aja harus bayar Rp 2.500.000,-?! Gila sih itu!"
🎧 Listen here
If you’d like to try out the model, feel free — it’s free for non-commercial use, and I’d love to hear your thoughts, suggestions, or improvements. 😊
Beta Was this translation helpful? Give feedback.
All reactions