Skip to content

我这边是使用4张3090 24G显存共96G显存利用Voxcpm2进行微调,官方全参数微调硬件说明40G显存就能搞定,但是我这边完全无法进行,老是报显存溢出 #335

@hanyong-max

Description

@hanyong-max

我这边目前选用LoRA微调,在4张3090上才能训练,参数设置如下:
pretrained_path: /home/data1/hanyong/VoxCPM2/
train_manifest: train.jsonl
val_manifest: val.jsonl
sample_rate: 16000 # AudioVAE encoder input rate; must match audio_vae_config.sample_rate
out_sample_rate: 48000 # AudioVAE decoder output rate; used for TensorBoard audio logging
batch_size: 1
grad_accum_steps: 2 # effective batch size = batch_size × grad_accum_steps = 16
num_workers: 8
num_iters: 1000
log_interval: 10
valid_interval: 500
save_interval: 500
learning_rate: 0.0001
weight_decay: 0.01
warmup_steps: 100
max_steps: 1000
max_batch_tokens: 8192
max_grad_norm: 1.0 # gradient clipping max norm; 0 = disabled
save_path: checkpoints/finetune_lora
tensorboard: logs/finetune_lora
lambdas:
loss/diff: 1.0
loss/stop: 1.0

LoRA configuration

lora:
enable_lm: true
enable_dit: true
enable_proj: false
r: 64
alpha: 128
dropout: 0.0

Distribution options (optional)

- If distribute=false (default): save pretrained_path as base_model in lora_config.json

- If distribute=true: save hf_model_id as base_model (hf_model_id is required)

hf_model_id: "openbmb/VoxCPM2"

distribute: true

,训练中间过程如下:
[train] step 0: loss/diff: 0.795605, loss/stop: 0.094951, lr: 0.000004, epoch: 0.000000, grad_norm: 0.280851
[val] step 0: loss/total: 0.886084, loss/diff: 0.786763, loss/stop: 0.099320, log interval: 2.19s
[Audio] Starting audio generation for 2 samples at step 0
[Audio] Loaded reference audio for sample 0: duration=14.35s
[Audio] Generating sample 0 with text: 'juNgu mavarip vixliri ministiri bir wAkillAr vOmik...'
28%|███████████████████████████████████████▍ | 146/514 [00:36<01:31, 4.04it/s]
[Audio] Generated audio for sample 0: duration=23.52s
[Audio] Created mel spectrogram figure for sample 0
[Audio] Loaded reference audio for sample 1: duration=6.99s
[Audio] Generating sample 1 with text: 'radoketsniN mijAzi juxqun vot yUrAk ziyaliy baxqil...'
27%|██████████████████████████████████████▍ | 67/244 [00:16<00:43, 4.03it/s]
[Audio] Generated audio for sample 1: duration=10.88s
[Audio] Created mel spectrogram figure for sample 1
[train] step 10: loss/diff: 0.830405, loss/stop: 0.070261, lr: 0.000014, epoch: 0.011996, grad_norm: 0.244753, log interval: 65.80s
[train] step 20: loss/diff: 0.917423, loss/stop: 0.079198, lr: 0.000024, epoch: 0.023992, grad_norm: 0.429277, log interval: 6.82s
[train] step 30: loss/diff: 0.635635, loss/stop: 0.047664, lr: 0.000034, epoch: 0.035987, grad_norm: 0.342566, log interval: 6.84s
[train] step 40: loss/diff: 0.841173, loss/stop: 0.045623, lr: 0.000044, epoch: 0.047983, grad_norm: 0.276314, log interval: 6.90s
[train] step 50: loss/diff: 0.609366, loss/stop: 0.020221, lr: 0.000054, epoch: 0.059979, grad_norm: 0.253714, log interval: 6.83s
[train] step 60: loss/diff: 0.821337, loss/stop: 0.043339, lr: 0.000064, epoch: 0.071975, grad_norm: 0.149903, log interval: 6.84s
[train] step 70: loss/diff: 0.792331, loss/stop: 0.009470, lr: 0.000074, epoch: 0.083971, grad_norm: 0.184850, log interval: 6.80s
[train] step 80: loss/diff: 0.796994, loss/stop: 0.040086, lr: 0.000084, epoch: 0.095966, grad_norm: 0.139579, log interval: 6.80s
[train] step 90: loss/diff: 0.681536, loss/stop: 0.030778, lr: 0.000094, epoch: 0.107962, grad_norm: 0.121979, log interval: 6.81s
[train] step 100: loss/diff: 0.785745, loss/stop: 0.032618, lr: 0.000100, epoch: 0.119958, grad_norm: 0.151634, log interval: 6.79s
[train] step 110: loss/diff: 0.817616, loss/stop: 0.024341, lr: 0.000100, epoch: 0.131954, grad_norm: 0.137358, log interval: 6.73s
[train] step 120: loss/diff: 0.623046, loss/stop: 0.061267, lr: 0.000100, epoch: 0.143950, grad_norm: 0.118033, log interval: 7.05s
[train] step 130: loss/diff: 0.748371, loss/stop: 0.046616, lr: 0.000100, epoch: 0.155945, grad_norm: 0.120386, log interval: 6.79s
[train] step 140: loss/diff: 0.748926, loss/stop: 0.011355, lr: 0.000099, epoch: 0.167941, grad_norm: 0.132304, log interval: 6.73s
[train] step 150: loss/diff: 0.545336, loss/stop: 0.031148, lr: 0.000099, epoch: 0.179937, grad_norm: 0.126028, log interval: 6.82s
[train] step 160: loss/diff: 0.731464, loss/stop: 0.023537, lr: 0.000099, epoch: 0.191933, grad_norm: 0.158642, log interval: 6.80s
[train] step 170: loss/diff: 0.914412, loss/stop: 0.012580, lr: 0.000098, epoch: 0.203929, grad_norm: 0.116412, log interval: 6.83s
[train] step 180: loss/diff: 0.614299, loss/stop: 0.042413, lr: 0.000098, epoch: 0.215924, grad_norm: 0.183552, log interval: 6.71s
[train] step 190: loss/diff: 0.590400, loss/stop: 0.040791, lr: 0.000097, epoch: 0.227920, grad_norm: 0.147735, log interval: 6.73s
[train] step 200: loss/diff: 0.809015, loss/stop: 0.014847, lr: 0.000097, epoch: 0.239916, grad_norm: 0.141026, log interval: 6.77s
[train] step 210: loss/diff: 0.920681, loss/stop: 0.022531, lr: 0.000096, epoch: 0.251912, grad_norm: 0.122057, log interval: 6.67s
[train] step 220: loss/diff: 0.674520, loss/stop: 0.039253, lr: 0.000095, epoch: 0.263908, grad_norm: 0.137016, log interval: 6.83s
[train] step 230: loss/diff: 0.633115, loss/stop: 0.029318, lr: 0.000095, epoch: 0.275903, grad_norm: 0.145284, log interval: 7.04s
[train] step 240: loss/diff: 0.719160, loss/stop: 0.013388, lr: 0.000094, epoch: 0.287899, grad_norm: 0.105581, log interval: 7.73s
[train] step 250: loss/diff: 0.826808, loss/stop: 0.012225, lr: 0.000093, epoch: 0.299895, grad_norm: 0.113429, log interval: 7.70s
[train] step 260: loss/diff: 0.937182, loss/stop: 0.012039, lr: 0.000092, epoch: 0.311891, grad_norm: 0.104130, log interval: 7.69s
[train] step 270: loss/diff: 0.760258, loss/stop: 0.058279, lr: 0.000091, epoch: 0.323887, grad_norm: 0.091529, log interval: 7.71s
[train] step 280: loss/diff: 0.734037, loss/stop: 0.038356, lr: 0.000090, epoch: 0.335882, grad_norm: 0.126253, log interval: 7.81s
[train] step 290: loss/diff: 0.827122, loss/stop: 0.006401, lr: 0.000089, epoch: 0.347878, grad_norm: 0.113183, log interval: 7.66s
[train] step 300: loss/diff: 0.842096, loss/stop: 0.040482, lr: 0.000088, epoch: 0.359874, grad_norm: 0.104184, log interval: 7.70s
[train] step 310: loss/diff: 0.805587, loss/stop: 0.008785, lr: 0.000087, epoch: 0.371870, grad_norm: 0.092663, log interval: 7.82s
[train] step 320: loss/diff: 0.773593, loss/stop: 0.019734, lr: 0.000085, epoch: 0.383866, grad_norm: 0.173443, log interval: 7.71s
[train] step 330: loss/diff: 0.724521, loss/stop: 0.012348, lr: 0.000084, epoch: 0.395861, grad_norm: 0.119392, log interval: 7.80s
[train] step 340: loss/diff: 0.564336, loss/stop: 0.032383, lr: 0.000083, epoch: 0.407857, grad_norm: 0.106394, log interval: 7.85s
[train] step 350: loss/diff: 0.938383, loss/stop: 0.009020, lr: 0.000082, epoch: 0.419853, grad_norm: 0.147562, log interval: 7.64s
[train] step 360: loss/diff: 0.694623, loss/stop: 0.022587, lr: 0.000080, epoch: 0.431849, grad_norm: 0.127968, log interval: 8.30s
[train] step 370: loss/diff: 0.833955, loss/stop: 0.043546, lr: 0.000079, epoch: 0.443845, grad_norm: 0.157896, log interval: 7.85s
[train] step 380: loss/diff: 0.867673, loss/stop: 0.013192, lr: 0.000077, epoch: 0.455840, grad_norm: 0.096919, log interval: 7.74s
[train] step 390: loss/diff: 0.823804, loss/stop: 0.040740, lr: 0.000076, epoch: 0.467836, grad_norm: 0.140176, log interval: 7.83s
[train] step 400: loss/diff: 0.670941, loss/stop: 0.036070, lr: 0.000074, epoch: 0.479832, grad_norm: 0.105600, log interval: 7.82s
[train] step 410: loss/diff: 0.742085, loss/stop: 0.079350, lr: 0.000073, epoch: 0.491828, grad_norm: 0.126576, log interval: 7.84s
[train] step 420: loss/diff: 0.746032, loss/stop: 0.018437, lr: 0.000071, epoch: 0.503824, grad_norm: 0.141396, log interval: 7.71s
[train] step 430: loss/diff: 0.747551, loss/stop: 0.020259, lr: 0.000070, epoch: 0.515819, grad_norm: 0.119167, log interval: 7.75s
[train] step 440: loss/diff: 0.677420, loss/stop: 0.012173, lr: 0.000068, epoch: 0.527815, grad_norm: 0.095579, log interval: 7.84s
[train] step 450: loss/diff: 0.555260, loss/stop: 0.034110, lr: 0.000066, epoch: 0.539811, grad_norm: 0.125872, log interval: 7.73s
[train] step 460: loss/diff: 0.836345, loss/stop: 0.022795, lr: 0.000065, epoch: 0.551807, grad_norm: 0.132507, log interval: 7.75s
[train] step 470: loss/diff: 0.744545, loss/stop: 0.016354, lr: 0.000063, epoch: 0.563803, grad_norm: 0.098208, log interval: 7.73s
[train] step 480: loss/diff: 0.570259, loss/stop: 0.014147, lr: 0.000061, epoch: 0.575798, grad_norm: 0.111879, log interval: 7.77s
[train] step 490: loss/diff: 0.818003, loss/stop: 0.006987, lr: 0.000060, epoch: 0.587794, grad_norm: 0.090772, log interval: 7.71s
[train] step 500: loss/diff: 0.838311, loss/stop: 0.003871, lr: 0.000058, epoch: 0.599790, grad_norm: 0.144140, log interval: 7.37s
[val] step 500: loss/total: 0.781678, loss/diff: 0.750861, loss/stop: 0.030817, log interval: 2.43s
[Audio] Starting audio generation for 2 samples at step 500
[Audio] Loaded reference audio for sample 0: duration=14.35s
[Audio] Generating sample 0 with text: 'juNgu mavarip vixliri ministiri bir wAkillAr vOmik...'
13%|██████████████████▊ | 69/514 [00:17<01:51, 3.99it/s]
[Audio] Generated audio for sample 0: duration=11.20s
[Audio] Created mel spectrogram figure for sample 0
[Audio] Loaded reference audio for sample 1: duration=6.99s
[Audio] Generating sample 1 with text: 'radoketsniN mijAzi juxqun vot yUrAk ziyaliy baxqil...'
14%|██████████████████▉ | 33/244 [00:08<00:53, 3.94it/s]
[Audio] Generated audio for sample 1: duration=5.44s
[Audio] Created mel spectrogram figure for sample 1
[train] step 510: loss/diff: 0.423096, loss/stop: 0.093216, lr: 0.000056, epoch: 0.611786, grad_norm: 0.132801, log interval: 35.87s
[train] step 520: loss/diff: 0.814054, loss/stop: 0.061541, lr: 0.000055, epoch: 0.623782, grad_norm: 0.125990, log interval: 7.31s
[train] step 530: loss/diff: 0.706048, loss/stop: 0.016536, lr: 0.000053, epoch: 0.635777, grad_norm: 0.094305, log interval: 7.51s
[train] step 540: loss/diff: 0.827399, loss/stop: 0.021573, lr: 0.000051, epoch: 0.647773, grad_norm: 0.119886, log interval: 7.55s
[train] step 550: loss/diff: 0.753327, loss/stop: 0.035193, lr: 0.000049, epoch: 0.659769, grad_norm: 0.153691, log interval: 7.25s
[train] step 560: loss/diff: 0.936800, loss/stop: 0.012613, lr: 0.000048, epoch: 0.671765, grad_norm: 0.127499, log interval: 7.43s
[train] step 570: loss/diff: 0.457559, loss/stop: 0.058247, lr: 0.000046, epoch: 0.683761, grad_norm: 0.124116, log interval: 7.48s
[train] step 580: loss/diff: 0.752189, loss/stop: 0.064857, lr: 0.000044, epoch: 0.695756, grad_norm: 0.139487, log interval: 7.56s
[train] step 590: loss/diff: 0.661088, loss/stop: 0.028505, lr: 0.000042, epoch: 0.707752, grad_norm: 0.101631, log interval: 7.45s
[train] step 600: loss/diff: 0.937110, loss/stop: 0.083241, lr: 0.000041, epoch: 0.719748, grad_norm: 0.139406, log interval: 7.28s
[train] step 610: loss/diff: 0.667592, loss/stop: 0.044597, lr: 0.000039, epoch: 0.731744, grad_norm: 0.096127, log interval: 7.57s
[train] step 620: loss/diff: 0.736525, loss/stop: 0.042822, lr: 0.000037, epoch: 0.743740, grad_norm: 0.106670, log interval: 7.34s
[train] step 630: loss/diff: 0.732863, loss/stop: 0.017251, lr: 0.000036, epoch: 0.755735, grad_norm: 0.165649, log interval: 6.76s
[train] step 640: loss/diff: 0.898116, loss/stop: 0.014935, lr: 0.000034, epoch: 0.767731, grad_norm: 0.131377, log interval: 6.80s
[train] step 650: loss/diff: 0.675499, loss/stop: 0.020595, lr: 0.000032, epoch: 0.779727, grad_norm: 0.092474, log interval: 6.67s
[train] step 660: loss/diff: 0.734146, loss/stop: 0.034231, lr: 0.000031, epoch: 0.791723, grad_norm: 0.113555, log interval: 6.81s
[train] step 670: loss/diff: 0.673073, loss/stop: 0.027141, lr: 0.000029, epoch: 0.803719, grad_norm: 0.115028, log interval: 6.81s
[train] step 680: loss/diff: 0.907107, loss/stop: 0.044932, lr: 0.000027, epoch: 0.815714, grad_norm: 0.122213, log interval: 7.03s
[train] step 690: loss/diff: 0.735726, loss/stop: 0.037376, lr: 0.000026, epoch: 0.827710, grad_norm: 0.106283, log interval: 7.50s
[train] step 700: loss/diff: 0.813986, loss/stop: 0.022954, lr: 0.000024, epoch: 0.839706, grad_norm: 0.122803, log interval: 7.31s
[train] step 710: loss/diff: 0.779479, loss/stop: 0.019504, lr: 0.000023, epoch: 0.851702, grad_norm: 0.094712, log interval: 7.50s
[train] step 720: loss/diff: 0.777215, loss/stop: 0.021808, lr: 0.000021, epoch: 0.863698, grad_norm: 0.116064, log interval: 7.47s
[train] step 730: loss/diff: 0.636927, loss/stop: 0.016778, lr: 0.000020, epoch: 0.875694, grad_norm: 0.118061, log interval: 7.45s
[train] step 740: loss/diff: 0.682794, loss/stop: 0.014009, lr: 0.000019, epoch: 0.887689, grad_norm: 0.125199, log interval: 7.51s
[train] step 750: loss/diff: 0.774863, loss/stop: 0.026116, lr: 0.000017, epoch: 0.899685, grad_norm: 0.132493, log interval: 7.55s
[train] step 760: loss/diff: 0.973172, loss/stop: 0.030828, lr: 0.000016, epoch: 0.911681, grad_norm: 0.207516, log interval: 7.54s
[train] step 770: loss/diff: 0.703505, loss/stop: 0.013881, lr: 0.000015, epoch: 0.923677, grad_norm: 0.115478, log interval: 7.45s
[train] step 780: loss/diff: 0.624844, loss/stop: 0.032698, lr: 0.000014, epoch: 0.935673, grad_norm: 0.108095, log interval: 7.55s
[train] step 790: loss/diff: 0.760115, loss/stop: 0.018237, lr: 0.000012, epoch: 0.947668, grad_norm: 0.119490, log interval: 7.39s
[train] step 800: loss/diff: 0.704373, loss/stop: 0.008140, lr: 0.000011, epoch: 0.959664, grad_norm: 0.105500, log interval: 7.52s
[train] step 810: loss/diff: 0.840312, loss/stop: 0.088185, lr: 0.000010, epoch: 0.971660, grad_norm: 0.115157, log interval: 7.55s
[train] step 820: loss/diff: 0.970671, loss/stop: 0.045782, lr: 0.000009, epoch: 0.983656, grad_norm: 0.103814, log interval: 6.92s
[train] step 830: loss/diff: 0.784177, loss/stop: 0.062182, lr: 0.000008, epoch: 0.995652, grad_norm: 0.119346, log interval: 7.91s
[train] step 840: loss/diff: 0.641534, loss/stop: 0.016578, lr: 0.000007, epoch: 1.007647, grad_norm: 0.099631, log interval: 8.11s
[train] step 850: loss/diff: 0.853894, loss/stop: 0.017510, lr: 0.000006, epoch: 1.019643, grad_norm: 0.116363, log interval: 7.45s
[train] step 860: loss/diff: 0.809989, loss/stop: 0.007736, lr: 0.000006, epoch: 1.031639, grad_norm: 0.153444, log interval: 7.38s
[train] step 870: loss/diff: 0.606541, loss/stop: 0.013379, lr: 0.000005, epoch: 1.043635, grad_norm: 0.122874, log interval: 7.37s
[train] step 880: loss/diff: 0.656071, loss/stop: 0.016340, lr: 0.000004, epoch: 1.055631, grad_norm: 0.145245, log interval: 7.35s
[train] step 890: loss/diff: 0.628986, loss/stop: 0.018479, lr: 0.000003, epoch: 1.067626, grad_norm: 0.127412, log interval: 7.36s
[train] step 900: loss/diff: 0.811349, loss/stop: 0.086133, lr: 0.000003, epoch: 1.079622, grad_norm: 0.123153, log interval: 7.40s
[train] step 910: loss/diff: 0.835353, loss/stop: 0.035216, lr: 0.000002, epoch: 1.091618, grad_norm: 0.137986, log interval: 7.34s
[train] step 920: loss/diff: 0.723550, loss/stop: 0.035460, lr: 0.000002, epoch: 1.103614, grad_norm: 0.117702, log interval: 7.43s
[train] step 930: loss/diff: 0.524251, loss/stop: 0.040635, lr: 0.000001, epoch: 1.115610, grad_norm: 0.082085, log interval: 7.39s
[train] step 940: loss/diff: 0.761991, loss/stop: 0.017262, lr: 0.000001, epoch: 1.127605, grad_norm: 0.181389, log interval: 6.96s
[train] step 950: loss/diff: 0.613489, loss/stop: 0.018746, lr: 0.000001, epoch: 1.139601, grad_norm: 0.154407, log interval: 6.78s
[train] step 960: loss/diff: 0.735493, loss/stop: 0.012589, lr: 0.000000, epoch: 1.151597, grad_norm: 0.127348, log interval: 7.23s
[train] step 970: loss/diff: 0.559653, loss/stop: 0.050953, lr: 0.000000, epoch: 1.163593, grad_norm: 0.124750, log interval: 6.99s
[train] step 980: loss/diff: 0.874724, loss/stop: 0.016016, lr: 0.000000, epoch: 1.175589, grad_norm: 0.095726, log interval: 6.80s
[train] step 990: loss/diff: 0.700195, loss/stop: 0.024318, lr: 0.000000, epoch: 1.187584, grad_norm: 0.089079, log interval: 6.74s
[train] step 999: loss/diff: 0.763323, loss/stop: 0.046123, lr: 0.000000, epoch: 1.198381, grad_norm: 0.233540, log interval: 6.22s
[val] step 999: loss/total: 0.773499, loss/diff: 0.742889, loss/stop: 0.030610, log interval: 2.10s
[Audio] Starting audio generation for 2 samples at step 999
[Audio] Loaded reference audio for sample 0: duration=14.35s
[Audio] Generating sample 0 with text: 'juNgu mavarip vixliri ministiri bir wAkillAr vOmik...'
13%|██████████████████▊ | 69/514 [00:17<01:52, 3.97it/s]
[Audio] Generated audio for sample 0: duration=11.20s
[Audio] Created mel spectrogram figure for sample 0
[Audio] Loaded reference audio for sample 1: duration=6.99s
[Audio] Generating sample 1 with text: 'radoketsniN mijAzi juxqun vot yUrAk ziyaliy baxqil...'
14%|██████████████████▉ | 33/244 [00:08<00:53, 3.94it/s]
[Audio] Generated audio for sample 1: duration=5.44s
[Audio] Created mel spectrogram figure for sample 1
[rank0]:[W606 20:22:46.740967195 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

这是训练成功了吗?有没有中断?而且训练到底需要多少显存?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions