Experimenting with Indonesian Model Training #1119

adigayung · 2025-07-07T23:30:08Z

adigayung
Jul 7, 2025

Hey everyone! 👋

I’ve trained several models so far, and honestly, F5-TTS is by far the best one I’ve worked with. I recently trained a new model called PapaRazi/Ijazah_Palsu_V2, which is an improved version of the earlier Ijazah_Palsu_V1.

The model was trained on around seventy thousand samples over the course of three days. It’s primarily trained in Bahasa Indonesia (ninety-five percent), with a smaller portion in English (five percent). I’ve noticed that the English pronunciation could use some improvement — likely due to the limited English data. I believe increasing it to around fifteen percent would yield better results.

🧪 Training Settings

{
    "exp_name": "F5TTS_v1_Base",
    "learning_rate": 1e-05,
    "batch_size_per_gpu": 1700,
    "batch_size_type": "frame",
    "max_samples": 64,
    "grad_accumulation_steps": 1,
    "max_grad_norm": 1,
    "epochs": 34,
    "num_warmup_updates": 7000,
    "save_per_updates": 15000,
    "keep_last_n_checkpoints": 7,
    "last_per_updates": 15000,
    "finetune": true,
    "file_checkpoint_train": "",
    "tokenizer_type": "char",
    "tokenizer_file": "",
    "mixed_precision": "fp16",
    "logger": "tensorboard",
    "bnb_optimizer": false
}

📦 Dataset
The dataset I used is called PapaRazi/id-tts-v2, which I personally collected and curated to better suit Indonesian speech synthesis.

To build and process the dataset, I used a tool I developed myself:
👉 github.com/adigayung/whisper-tools
This tool helped with splitting, normalization, cleanup, and preparing training-ready data.

The built-in dataset splitter from F5-TTS didn’t quite fit my needs — it often created segments that were way too short or overly long (some over twelve minutes and one hundred megabytes, which isn't practical for training).

🔊 Voice Samples
Here are some sample outputs from the model:

📌 Natural narration example:
"Suatu hari nanti, suara ini mungkin tidak bisa dibedakan lagi dari suara manusia asli."
🎧 Listen here

📌 Simple number reading is working quite well:
"Serius?! Tiket konsernya habis dalam waktu tiga menit?!"
🎧 Listen here

📌 But for large numbers (like in the millions), the output is still quite inaccurate and sometimes hallucinates a bit —
probably because I didn’t include enough examples of that kind in the dataset.
"Masa cuma buat beli kursi kantor aja harus bayar Rp 2.500.000,-?! Gila sih itu!"
🎧 Listen here

If you’d like to try out the model, feel free — it’s free for non-commercial use, and I’d love to hear your thoughts, suggestions, or improvements. 😊

jingzhoucao05-lang · 2025-09-03T10:03:29Z

jingzhoucao05-lang
Sep 3, 2025

can i use your indo model

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experimenting with Indonesian Model Training #1119

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Experimenting with Indonesian Model Training #1119

Uh oh!

adigayung Jul 7, 2025

Replies: 1 comment

Uh oh!

jingzhoucao05-lang Sep 3, 2025

adigayung
Jul 7, 2025

jingzhoucao05-lang
Sep 3, 2025