[QEff. Finetune]: Correcting num_steps trained as per max_train_step and displaying non scaled loss value on console. #527

quic-swatia · 2025-08-01T13:56:49Z

No description provided.

… scaled loss value on console Signed-off-by: Swati Allabadi <[email protected]>

QEfficient/finetune/utils/train_utils.py

quic-meetkuma · 2025-08-04T05:49:12Z

QEfficient/finetune/utils/train_utils.py

@@ -265,7 +265,7 @@ def train(
                    )

            pbar.set_description(
-                f"Training Epoch: {epoch + 1}/{train_config.num_epochs}, step {step + 1}/{len(train_dataloader)} completed (loss: {loss.detach().float()})"
+                f"Training Epoch: {epoch + 1}/{train_config.num_epochs}, step {step + 1}/{len(train_dataloader)} completed (loss: {(loss * num_samples_in_cur_update).detach().float()})"


For batch_size=4, with 3rd sample being a padded sample, the losses are L1, L2, L3 and L4. The loss for L3 is zeroed because it is padded sample. Now, here the loss variable is the avg of all 4 losses. To make it an average of 3 values, we should be multiplying the loss with 4 and divide it with 3.

Let me know is this the problem you are trying to solve? If so, then how your solution helps?

This is not the issue. The use case which you have explained is already taken care at line #200.

In case of gradient_accumulation, since the per step loss is scaled down by the value of gradient_accumulation_steps, and loss is printed on the console only after the step finishes, the scaled down value was being displayed on the console. Corrected it with the above change.

quic-meetkuma · 2025-08-04T06:05:14Z

Can you prefix the title with "[QEff. Finetune]:" so that we can filter out whenever needed in future.

Signed-off-by: Swati Allabadi <[email protected]>

quic-meetkuma · 2025-08-05T04:42:43Z

QEfficient/finetune/utils/train_utils.py

@@ -326,7 +325,7 @@ def train(
            )

            if is_rank_zero():
-                tensorboard_updates.add_scalars("loss", {"eval": eval_epoch_loss}, total_train_steps)
+                tensorboard_updates.add_scalars("loss", {"eval": eval_epoch_loss}, total_train_steps - 1)


I think the better design is to properly use this total_train_steps variable. I suggest below changes to make the design cleaner.

Remove the updation of total_train_steps at L130.

At the L152, at the start of each iteration over dataloader update the total_train_steps variable: total_train_steps = len(train_dataloader) * (epoch + 1) + step. This will be helpful throughout each step of dataloader.

Replace the condition at L157 with the condition of L162. Both are doing the same thing.

Remove L158 updation of total_train_steps.

Remove L164 updation of total_train_steps.

The condition at L166 should include >= instead of >.

At L212, the tensorboard should take total_train_steps not 'total_train_steps - 1'.

At L328, the tensorboard should take total_train_steps not 'total_train_steps - 1'.

With this there will not be a manual +1 or -1 into the total_train_steps. This will make it more maintainable and understandable.

Skipped 'Replace the condition at L157 with the condition of L162. Both are doing the same thing.' as both are required. Made total_train_steps = len(train_dataloader) * (epoch) + step , because epoch +1 will give incorrect number. Accomodated rest.

Signed-off-by: Swati Allabadi <[email protected]>

quic-meetkuma

LGTM, thanks for quick fix. :)

…and displaying non scaled loss value on console. (quic#527) Signed-off-by: Swati Allabadi <[email protected]> Co-authored-by: Swati Allabadi <[email protected]>

Correcting num_steps trained as per max_train_step and displaying non…

b4e6171

… scaled loss value on console Signed-off-by: Swati Allabadi <[email protected]>

quic-swatia assigned quic-meetkuma and quic-swatia and unassigned quic-meetkuma Aug 1, 2025

quic-swatia requested review from quic-meetkuma and quic-akuruvil August 1, 2025 14:03

quic-meetkuma requested changes Aug 4, 2025

View reviewed changes

quic-swatia changed the title ~~Correcting num_steps trained as per max_train_step and displaying non scaled loss value on console.~~ [QEff. Finetune]: Correcting num_steps trained as per max_train_step and displaying non scaled loss value on console. Aug 4, 2025

quic-swatia marked this pull request as ready for review August 4, 2025 07:42

quic-swatia requested review from quic-rishinr, ochougul, quic-hemagnih and quic-amitraj as code owners August 4, 2025 07:42

Renaming scaled loss

c4c5c11

Signed-off-by: Swati Allabadi <[email protected]>

quic-meetkuma reviewed Aug 5, 2025

View reviewed changes

quic-swatia force-pushed the loss_and_num_steps_fixes branch from e74306e to 1ff863b Compare August 5, 2025 09:54

Minor change about total_train_steps update

aae5d36

Signed-off-by: Swati Allabadi <[email protected]>

quic-swatia force-pushed the loss_and_num_steps_fixes branch from 1ff863b to aae5d36 Compare August 5, 2025 10:00

Changed location of intermediate_step calculation

7c55e79

Signed-off-by: Swati Allabadi <[email protected]>

quic-meetkuma approved these changes Aug 5, 2025

View reviewed changes

Merge branch 'main' into loss_and_num_steps_fixes

388bc85

quic-swatia merged commit 2c27cc1 into quic:main Aug 5, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QEff. Finetune]: Correcting num_steps trained as per max_train_step and displaying non scaled loss value on console. #527

[QEff. Finetune]: Correcting num_steps trained as per max_train_step and displaying non scaled loss value on console. #527

Uh oh!

quic-swatia commented Aug 1, 2025

Uh oh!

Uh oh!

quic-meetkuma Aug 4, 2025

Uh oh!

quic-swatia Aug 4, 2025

Uh oh!

quic-meetkuma commented Aug 4, 2025

Uh oh!

quic-meetkuma Aug 5, 2025

Uh oh!

quic-swatia Aug 5, 2025

Uh oh!

quic-meetkuma left a comment

Uh oh!

Uh oh!

Uh oh!

[QEff. Finetune]: Correcting num_steps trained as per max_train_step and displaying non scaled loss value on console. #527

[QEff. Finetune]: Correcting num_steps trained as per max_train_step and displaying non scaled loss value on console. #527

Uh oh!

Conversation

quic-swatia commented Aug 1, 2025

Uh oh!

Uh oh!

quic-meetkuma Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

quic-swatia Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

quic-meetkuma commented Aug 4, 2025

Uh oh!

quic-meetkuma Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

quic-swatia Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

quic-meetkuma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!