To train a VibeVoice 1.5B LoRa, a machine with at least 16gb VRAM is recommended.
To train a VibeVoice 7B LoRa, a machine with at least 48gb VRAM is recommended.
Keep in mind longer audios increase VRAM requirements
It is recommended to install this in a fresh environment. Specifically, the Dockerized environment runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04 has been tested to work.
Transformers version 4.51.3 is known to work, while other versions have errors related to Qwen2 architecture.
git clone https://github.com/voicepowered-ai/VibeVoice-finetuning
pip install -e .
pip uninstall -y transformers && pip install transformers==4.51.3
(OPTIONAL) wandb login
(OPTIONAL) export HF_HOME=/workspace/hf_models
We put some code together for training VibeVoice (7B) with LoRA. This uses the vendored VibeVoice model/processor and trains with a dual loss: masked CE on text tokens plus diffusion MSE on acoustic latents.
Requirements:
-
Download a compatible VibeVoice 7B or 1.5b checkpoint (config + weights) and its processor files (preprocessor_config.json) or run straight from HF model.
-
A 24khz audio dataset with audio files (target audio), text prompts (transcriptions) and optionally voice prompts (reference audio)
python -m src.finetune_vibevoice_lora \
--model_name_or_path aoi-ot/VibeVoice-Large \
--processor_name_or_path src/vibevoice/processor \
--dataset_name your/dataset \
--text_column_name text \
--audio_column_name audio \
--voice_prompts_column_name voice_prompts \
--output_dir outputTrain3 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--learning_rate 2.5e-5 \
--num_train_epochs 5 \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100 \
--report_to wandb \
--remove_unused_columns False \
--bf16 True \
--do_train \
--gradient_clipping \
--gradient_checkpointing False \
--ddpm_batch_mul 4 \
--diffusion_loss_weight 1.4 \
--train_diffusion_head True \
--ce_loss_weight 0.04 \
--voice_prompt_drop_rate 0.2 \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
--lr_scheduler_type cosine \
--warmup_ratio 0.03 \
--max_grad_norm 0.8
python -m src.finetune_vibevoice_lora \
--model_name_or_path aoi-ot/VibeVoice-Large \
--processor_name_or_path src/vibevoice/processor \
--train_jsonl prompts.jsonl \
--text_column_name text \
--audio_column_name audio \
--output_dir outputTrain3 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--learning_rate 2.5e-5 \
--num_train_epochs 5 \
--logging_steps 10 \
--save_steps 100 \
--report_to wandb \
--remove_unused_columns False \
--bf16 True \
--do_train \
--gradient_clipping \
--gradient_checkpointing False \
--ddpm_batch_mul 4 \
--diffusion_loss_weight 1.4 \
--train_diffusion_head True \
--ce_loss_weight 0.04 \
--voice_prompt_drop_rate 0.2 \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
--lr_scheduler_type cosine \
--warmup_ratio 0.03 \
--max_grad_norm 0.8
You can provide an optional voice_prompts key. If it is omitted, a voice prompt will be automatically generated from the target audio.
Example without a pre-defined voice prompt (will be auto-generated):
{"text": "Speaker 0: Speaker0 transcription.", "audio": "/workspace/wavs/segment_000000.wav"}
Example with a pre-defined voice prompt:
{"text": "Speaker 0: Speaker0 transcription.", "audio": "/workspace/wavs/segment_000000.wav", "voice_prompts": "/path/to/a/different/prompt.wav"}
Example with multiple speakers and voice prompts:
{"text": "Speaker 0: How is the project coming along?\nSpeaker 1: It's going well, we should be finished by Friday.", "audio": "/data/conversations/convo_01.wav", "voice_prompts": ["/data/prompts/alice_voice_prompt.wav", "/data/prompts/bob_voice_prompt.wav"]}
-
Audio is assumed to be 24 kHz; input audio will be loaded/resampled to 24 kHz.
-
If you pass raw NumPy arrays or torch Tensors as audio (without sampling rate metadata), the collator assumes they are already 24 kHz. To trigger resampling, provide dicts like {"array": <np.ndarray>, "sampling_rate": } or file paths.
-
Tokenizers (acoustic/semantic) are frozen by default. LoRA is applied to the LLM (Qwen) and optionally to the diffusion head.
-
The collator builds interleaved sequences with speech placeholders and computes the required masks for diffusion loss.
-
If a voice_prompts column is not provided in your dataset for a given sample, a voice prompt is automatically generated by taking a random clip from the target audio. This fallback ensures the model's voice cloning ability is maintained. You can override this behavior by providing your own voice prompts.
-
Said voice_prompts are randomly dropped during training to improve generalization. Drop rates of 0.2 and 0.25 have been tested with satisfactory results.
-
The model learns to emit a closing
[speech_end]token after target placeholders. -
For multi‑speaker prompts, ensure
voice_promptslist order matchesSpeaker 0/1/...tags in your text. -
LoRA adapters are saved under
output_dir/loraafter training.
Comprehensive list of all the command-line arguments available for the fine-tuning script.
Controls the base model, its configuration, and which components are trained.
-
--model_name_or_path -
What it does: Specifies the path to the pretrained VibeVoice base model. This can be a local directory or a Hugging Face Hub repository ID.
-
Required: Yes.
-
Example:
--model_name_or_path aoi-ot/VibeVoice-Large
-
--processor_name_or_path -
What it does: Specifies the path to the VibeVoice processor configuration. If not provided, it defaults to the
model_name_or_path. -
Example:
--processor_name_or_path src/vibevoice/processor
-
--train_diffusion_head -
What it does: A boolean flag to enable full fine-tuning of the diffusion prediction head. When enabled, all parameters of the diffusion head become trainable.
-
Example:
--train_diffusion_head True
-
--train_connectors -
What it does: A boolean flag to enable training of the acoustic and semantic connectors, which bridge different parts of the model.
-
Example:
--train_connectors True
-
--lora_target_modules -
What it does: A comma-separated string of module names within the language model to apply LoRA adapters to. This is the primary way to enable LoRA for the text-processing part of the model.
-
Example:
--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
-
--lora_r -
What it does: The rank (
r) of the LoRA decomposition. A smaller number means fewer trainable parameters. -
Default:
8 -
Example:
--lora_r 16
-
--lora_alpha -
What it does: The scaling factor for the LoRA weights. A common practice is to set
lora_alphato be four times the value oflora_r. -
Default:
32 -
Example:
--lora_alpha 64
-
--lora_wrap_diffusion_head -
What it does: An alternative to
--train_diffusion_head. IfTrue, it applies LoRA adapters to the diffusion head instead of fine-tuning it fully, enabling more parameter-efficient training of the head. Must only use--train_diffusion_heador--lora_wrap_diffusion_head -
Default:
False -
--layers_to_freeze -
What it does: Comma-separated indices of diffusion head layers to freeze (e.g., '0,1,5,7,8'). Diffusion head layer indices
-
Default:
None
Defines the dataset to be used, its structure, and how it should be processed.
-
--train_jsonl -
What it does: Path to your local training data file in JSONL (JSON Lines) format. Each line should be a JSON object with keys for text and audio path.
-
Example:
--train_jsonl prompts.jsonl
-
--validation_jsonl -
What it does: Optional path to a local validation data file in JSONL format.
-
Example:
--validation_jsonl validation_prompts.jsonl
-
--text_column_name -
What it does: The name of the key in your JSONL file that contains the text transcription/prompt.
-
Default:
text -
Example:
--text_column_name "prompt"
-
--audio_column_name -
What it does: The name of the key in your JSONL file that contains the path to the audio file.
-
Default:
audio -
Example:
--audio_column_name "file_path"
-
--voice_prompt_drop_rate -
What it does: The probability (from 0.0 to 1.0) of randomly dropping the conditioning voice prompt during training. This acts as a regularizer.
-
Default:
0.0 -
Example:
--voice_prompt_drop_rate 0.2
Standard Hugging Face TrainingArguments that control the training loop, optimizer, and saving.
-
--output_dir -
What it does: The directory where model checkpoints and final outputs will be saved.
-
Required: Yes.
-
Example:
--output_dir output_model
-
--per_device_train_batch_size -
What it does: The number of training examples processed per GPU in a single step.
-
Example:
--per_device_train_batch_size 8
-
--gradient_accumulation_steps -
What it does: The number of forward passes to accumulate gradients for before performing an optimizer step. This effectively increases the batch size without using more VRAM.
-
Example:
--gradient_accumulation_steps 16
-
--learning_rate -
What it does: The initial learning rate for the optimizer.
-
Example:
--learning_rate 2.5e-5
-
--num_train_epochs -
What it does: The total number of times to iterate over the entire training dataset.
-
Example:
--num_train_epochs 5
-
--logging_steps -
What it does: How often (in steps) to log training metrics like loss.
-
Example:
--logging_steps 10
-
--save_steps -
What it does: How often (in steps) to save a model checkpoint.
-
Example:
--save_steps 100
-
--report_to -
What it does: The integration to report logs to. Can be
wandb,tensorboard, ornone. -
Example:
--report_to wandb
-
--remove_unused_columns -
What it does: Whether to remove columns from the dataset not used by the model's
forwardmethod. This must be set toFalsefor this script to work correctly. -
Example:
--remove_unused_columns False
-
--bf16 -
What it does: Enables mixed-precision training using
bfloat16. This speeds up training and reduces memory usage on compatible GPUs (NVIDIA Ampere series and newer). -
Example:
--bf16 True
-
--gradient_checkpointing -
What it does: A memory-saving technique that trades compute for memory. Useful for training large models on limited VRAM.
-
Example:
--gradient_checkpointing True
-
--lr_scheduler_type -
What it does: The type of learning rate schedule to use (e.g.,
linear,cosine,constant). -
Example:
--lr_scheduler_type cosine
-
--warmup_ratio -
What it does: The proportion of total training steps used for a linear warmup from 0 to the initial learning rate.
-
Example:
--warmup_ratio 0.03
Special arguments to control VibeVoice-specific training behaviors.
-
--gradient_clipping -
What it does: A custom boolean flag that acts as the master switch for gradient clipping. If you include this flag, the value from
--max_grad_normwill be used to prevent exploding gradients. -
Example:
--gradient_clipping
-
--max_grad_norm -
What it does: The maximum value for gradient clipping. Only active if
--gradient_clippingis also used. -
Default:
1.0 -
Example:
--max_grad_norm 0.8
-
--diffusion_loss_weight -
What it does: A float that scales the importance of the diffusion loss (for speech generation quality) in the total loss calculation.
-
Example:
--diffusion_loss_weight 1.4
-
--ce_loss_weight -
What it does: A float that scales the importance of the Cross-Entropy loss (for text prediction accuracy) in the total loss calculation.
-
Example:
--ce_loss_weight 0.04
-
--ddpm_batch_mul -
What it does: An integer multiplier for the batch size used specifically within the diffusion process.
-
Example:
--ddpm_batch_mul 4