Skip to content

voicepowered-ai/VibeVoice-finetuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unofficial WIP Finetuning repo for VibeVoice

Hardware requirements

To train a VibeVoice 1.5B LoRa, a machine with at least 16gb VRAM is recommended.

To train a VibeVoice 7B LoRa, a machine with at least 48gb VRAM is recommended.

Keep in mind longer audios increase VRAM requirements

Installation

It is recommended to install this in a fresh environment. Specifically, the Dockerized environment runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04 has been tested to work.

Transformers version 4.51.3 is known to work, while other versions have errors related to Qwen2 architecture.

git clone https://github.com/voicepowered-ai/VibeVoice-finetuning

pip install -e .

pip uninstall -y transformers && pip install transformers==4.51.3

(OPTIONAL) wandb login

(OPTIONAL) export HF_HOME=/workspace/hf_models

Usage

VibeVoice 1.5B / 7B (LoRA) fine-tuning

We put some code together for training VibeVoice (7B) with LoRA. This uses the vendored VibeVoice model/processor and trains with a dual loss: masked CE on text tokens plus diffusion MSE on acoustic latents.

Requirements:

  • Download a compatible VibeVoice 7B or 1.5b checkpoint (config + weights) and its processor files (preprocessor_config.json) or run straight from HF model.

  • A 24khz audio dataset with audio files (target audio), text prompts (transcriptions) and optionally voice prompts (reference audio)

Training with Hugging Face Dataset

python -m src.finetune_vibevoice_lora \

--model_name_or_path aoi-ot/VibeVoice-Large \

--processor_name_or_path src/vibevoice/processor \

--dataset_name your/dataset \

--text_column_name text \

--audio_column_name audio \

--voice_prompts_column_name voice_prompts \

--output_dir outputTrain3 \

--per_device_train_batch_size 8 \

--gradient_accumulation_steps 16 \

--learning_rate 2.5e-5 \

--num_train_epochs 5 \

--logging_steps 10 \

--save_steps 100 \

--eval_steps 100 \

--report_to wandb \

--remove_unused_columns False \

--bf16 True \

--do_train \

--gradient_clipping \

--gradient_checkpointing False \

--ddpm_batch_mul 4 \

--diffusion_loss_weight 1.4 \

--train_diffusion_head True \

--ce_loss_weight 0.04 \

--voice_prompt_drop_rate 0.2 \

--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \

--lr_scheduler_type cosine \

--warmup_ratio 0.03 \

--max_grad_norm 0.8

Training with Local JSONL Dataset

python -m src.finetune_vibevoice_lora \

--model_name_or_path aoi-ot/VibeVoice-Large \

--processor_name_or_path src/vibevoice/processor \

--train_jsonl prompts.jsonl \

--text_column_name text \

--audio_column_name audio \

--output_dir outputTrain3 \

--per_device_train_batch_size 8 \

--gradient_accumulation_steps 16 \

--learning_rate 2.5e-5 \

--num_train_epochs 5 \

--logging_steps 10 \

--save_steps 100 \

--report_to wandb \

--remove_unused_columns False \

--bf16 True \

--do_train \

--gradient_clipping \

--gradient_checkpointing False \

--ddpm_batch_mul 4 \

--diffusion_loss_weight 1.4 \

--train_diffusion_head True \

--ce_loss_weight 0.04 \

--voice_prompt_drop_rate 0.2 \

--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \

--lr_scheduler_type cosine \

--warmup_ratio 0.03 \

--max_grad_norm 0.8

JSONL format:

You can provide an optional voice_prompts key. If it is omitted, a voice prompt will be automatically generated from the target audio.

Example without a pre-defined voice prompt (will be auto-generated):

{"text": "Speaker 0: Speaker0 transcription.", "audio": "/workspace/wavs/segment_000000.wav"}

Example with a pre-defined voice prompt:

{"text": "Speaker 0: Speaker0 transcription.", "audio": "/workspace/wavs/segment_000000.wav", "voice_prompts": "/path/to/a/different/prompt.wav"}

Example with multiple speakers and voice prompts:

{"text": "Speaker 0: How is the project coming along?\nSpeaker 1: It's going well, we should be finished by Friday.", "audio": "/data/conversations/convo_01.wav", "voice_prompts": ["/data/prompts/alice_voice_prompt.wav", "/data/prompts/bob_voice_prompt.wav"]}

Notes:

  • Audio is assumed to be 24 kHz; input audio will be loaded/resampled to 24 kHz.

  • If you pass raw NumPy arrays or torch Tensors as audio (without sampling rate metadata), the collator assumes they are already 24 kHz. To trigger resampling, provide dicts like {"array": <np.ndarray>, "sampling_rate": } or file paths.

  • Tokenizers (acoustic/semantic) are frozen by default. LoRA is applied to the LLM (Qwen) and optionally to the diffusion head.

  • The collator builds interleaved sequences with speech placeholders and computes the required masks for diffusion loss.

  • If a voice_prompts column is not provided in your dataset for a given sample, a voice prompt is automatically generated by taking a random clip from the target audio. This fallback ensures the model's voice cloning ability is maintained. You can override this behavior by providing your own voice prompts.

  • Said voice_prompts are randomly dropped during training to improve generalization. Drop rates of 0.2 and 0.25 have been tested with satisfactory results.

  • The model learns to emit a closing [speech_end] token after target placeholders.

  • For multi‑speaker prompts, ensure voice_prompts list order matches Speaker 0/1/... tags in your text.

  • LoRA adapters are saved under output_dir/lora after training.

Acknowledgements

Training Script Arguments

Comprehensive list of all the command-line arguments available for the fine-tuning script.

Model & Architecture Arguments

Controls the base model, its configuration, and which components are trained.

  • --model_name_or_path

  • What it does: Specifies the path to the pretrained VibeVoice base model. This can be a local directory or a Hugging Face Hub repository ID.

  • Required: Yes.

  • Example:

--model_name_or_path aoi-ot/VibeVoice-Large
  • --processor_name_or_path

  • What it does: Specifies the path to the VibeVoice processor configuration. If not provided, it defaults to the model_name_or_path.

  • Example:

--processor_name_or_path src/vibevoice/processor
  • --train_diffusion_head

  • What it does: A boolean flag to enable full fine-tuning of the diffusion prediction head. When enabled, all parameters of the diffusion head become trainable.

  • Example:

--train_diffusion_head True
  • --train_connectors

  • What it does: A boolean flag to enable training of the acoustic and semantic connectors, which bridge different parts of the model.

  • Example:

--train_connectors True
  • --lora_target_modules

  • What it does: A comma-separated string of module names within the language model to apply LoRA adapters to. This is the primary way to enable LoRA for the text-processing part of the model.

  • Example:

--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
  • --lora_r

  • What it does: The rank (r) of the LoRA decomposition. A smaller number means fewer trainable parameters.

  • Default: 8

  • Example:

--lora_r 16
  • --lora_alpha

  • What it does: The scaling factor for the LoRA weights. A common practice is to set lora_alpha to be four times the value of lora_r.

  • Default: 32

  • Example:

--lora_alpha 64
  • --lora_wrap_diffusion_head

  • What it does: An alternative to --train_diffusion_head. If True, it applies LoRA adapters to the diffusion head instead of fine-tuning it fully, enabling more parameter-efficient training of the head. Must only use --train_diffusion_head or --lora_wrap_diffusion_head

  • Default: False

  • --layers_to_freeze

  • What it does: Comma-separated indices of diffusion head layers to freeze (e.g., '0,1,5,7,8'). Diffusion head layer indices

  • Default: None

Data & Processing Arguments

Defines the dataset to be used, its structure, and how it should be processed.

  • --train_jsonl

  • What it does: Path to your local training data file in JSONL (JSON Lines) format. Each line should be a JSON object with keys for text and audio path.

  • Example:

--train_jsonl prompts.jsonl
  • --validation_jsonl

  • What it does: Optional path to a local validation data file in JSONL format.

  • Example:

--validation_jsonl validation_prompts.jsonl
  • --text_column_name

  • What it does: The name of the key in your JSONL file that contains the text transcription/prompt.

  • Default: text

  • Example:

--text_column_name "prompt"
  • --audio_column_name

  • What it does: The name of the key in your JSONL file that contains the path to the audio file.

  • Default: audio

  • Example:

--audio_column_name "file_path"
  • --voice_prompt_drop_rate

  • What it does: The probability (from 0.0 to 1.0) of randomly dropping the conditioning voice prompt during training. This acts as a regularizer.

  • Default: 0.0

  • Example:

--voice_prompt_drop_rate 0.2

Core Training Arguments

Standard Hugging Face TrainingArguments that control the training loop, optimizer, and saving.

  • --output_dir

  • What it does: The directory where model checkpoints and final outputs will be saved.

  • Required: Yes.

  • Example:

--output_dir output_model
  • --per_device_train_batch_size

  • What it does: The number of training examples processed per GPU in a single step.

  • Example:

--per_device_train_batch_size 8
  • --gradient_accumulation_steps

  • What it does: The number of forward passes to accumulate gradients for before performing an optimizer step. This effectively increases the batch size without using more VRAM.

  • Example:

--gradient_accumulation_steps 16
  • --learning_rate

  • What it does: The initial learning rate for the optimizer.

  • Example:

--learning_rate 2.5e-5
  • --num_train_epochs

  • What it does: The total number of times to iterate over the entire training dataset.

  • Example:

--num_train_epochs 5
  • --logging_steps

  • What it does: How often (in steps) to log training metrics like loss.

  • Example:

--logging_steps 10
  • --save_steps

  • What it does: How often (in steps) to save a model checkpoint.

  • Example:

--save_steps 100
  • --report_to

  • What it does: The integration to report logs to. Can be wandb, tensorboard, or none.

  • Example:

--report_to wandb
  • --remove_unused_columns

  • What it does: Whether to remove columns from the dataset not used by the model's forward method. This must be set to False for this script to work correctly.

  • Example:

--remove_unused_columns False
  • --bf16

  • What it does: Enables mixed-precision training using bfloat16. This speeds up training and reduces memory usage on compatible GPUs (NVIDIA Ampere series and newer).

  • Example:

--bf16 True
  • --gradient_checkpointing

  • What it does: A memory-saving technique that trades compute for memory. Useful for training large models on limited VRAM.

  • Example:

--gradient_checkpointing True
  • --lr_scheduler_type

  • What it does: The type of learning rate schedule to use (e.g., linear, cosine, constant).

  • Example:

--lr_scheduler_type cosine
  • --warmup_ratio

  • What it does: The proportion of total training steps used for a linear warmup from 0 to the initial learning rate.

  • Example:

--warmup_ratio 0.03

Custom VibeVoice Training Arguments

Special arguments to control VibeVoice-specific training behaviors.

  • --gradient_clipping

  • What it does: A custom boolean flag that acts as the master switch for gradient clipping. If you include this flag, the value from --max_grad_norm will be used to prevent exploding gradients.

  • Example:

--gradient_clipping
  • --max_grad_norm

  • What it does: The maximum value for gradient clipping. Only active if --gradient_clipping is also used.

  • Default: 1.0

  • Example:

--max_grad_norm 0.8
  • --diffusion_loss_weight

  • What it does: A float that scales the importance of the diffusion loss (for speech generation quality) in the total loss calculation.

  • Example:

--diffusion_loss_weight 1.4
  • --ce_loss_weight

  • What it does: A float that scales the importance of the Cross-Entropy loss (for text prediction accuracy) in the total loss calculation.

  • Example:

--ce_loss_weight 0.04
  • --ddpm_batch_mul

  • What it does: An integer multiplier for the batch size used specifically within the diffusion process.

  • Example:

--ddpm_batch_mul 4

About

Unofficial WIP LoRa Finetuning repository for VibeVoice

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages