Skip to content

Conversation

@nil0x9
Copy link
Contributor

@nil0x9 nil0x9 commented Dec 5, 2025

  1. use reduced tokens stats for exp_tgs;
  2. rename some variables (drop the reduced_ prefix, use total_ instead for clarity);
  3. drop maxvio stats in loss_log as it is already covered in internal metrics.

1. use reduced tokens stats for exp_tgs;
2. rename some variables (drop the reduced_ prefix, use total_ instead
for clarity);
3. drop maxvio stats in loss_log as it is already covered in internal
metrics.
self._reduced_consumed_tokens += reduced_step_consumed_tokens

self._exp_consumed_tokens += step_consumed_tokens
self._total_consumed_tokens += reduced_step_consumed_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里命名要注意保持和loss在代码中和日志中的一致。
比如 total_loss表示单卡上总loss(llm, balance, zloss), reduced_llm_loss表示各卡聚合后llm_loss。
所以建议 要么保留原来的 reduced_consumed_tokens命名,要么将 loss的命名也改成和这里一致。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里确实有歧义,在现在的 codebase 里,total_ 被用来表示时间维度上的总和 (e.g., total_[step|epoch|iter]),也有被用来表达空间上的总和(total_GPU_per_node),total_loss是表示几种 loss 的聚合, 显得有点混乱。感觉是应该有个好的 naming convention 来把这些统一的区分开。

这里的_reduced_consumed_tokens 其实同时兼有 时间维度上的累积,和空间维度的聚合两个含义,所以感觉有点奇怪,我想一想要怎么改善这个 PR。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants