-
Notifications
You must be signed in to change notification settings - Fork 100
Description
In order to solve an out of memory issue during the train of the ASR recipe (using Whisper Turbo and EuroLLM 9B on a custom dataset witha 40GB GPU), I decide to use quantization of the LLM.
The recipe has the
train_config.quantization=false
that can be set to true. If I do so, I don't get any Out of Memory error anymore , instead the train run but I don't see any loss meither accuracy being computed:
Training Epoch: 1/3, step 24/25 completed (loss: nan, acc: 0.0): 100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [00:42<00:00, 1.71s/it] [2025-04-28 12:51:10][slam_llm.utils.train_utils][INFO] - Epoch 1: train_perplexity=nan, train_epoch_loss=nan, epoch time 43.23003945290111s [2025-04-28 12:51:10][slam_llm.utils.train_utils][INFO] - Max CUDA memory allocated was 18 GB [2025-04-28 12:51:10][slam_llm.utils.train_utils][INFO] - Max CUDA memory reserved was 25 GB [2025-04-28 12:51:10][slam_llm.utils.train_utils][INFO] - Peak active CUDA memory was 18 GB [2025-04-28 12:51:10][slam_llm.utils.train_utils][INFO] - Cuda Malloc retires : 0 [2025-04-28 12:51:10][slam_llm.utils.train_utils][INFO] - CPU Total Peak Memory consumed during the train (max): 4 GB
Has anyone had a similar issue and found the problem?