-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Open
Labels
Description
Describe the bug
using --fp8-param-gather and --reuse-grad-buf-for-mxfp8-param-ag without --overlap-grad-reduce and --overlap-param-gather will cause NaN loss.
If both --fp8-param-gather, --reuse-grad-buf-for-mxfp8-param-ag,--overlap-grad-reduce and --overlap-param-gather are enabled, it's fine.
Steps/Code to reproduce bug
Enable --fp8-recipe=mxfp8 --fp8-param-gather --reuse-grad-buf-for-mxfp8-param-ag.
Don't use --overlap-grad-reduce and --overlap-param-gather
Expected behavior
Converge correctly without dp comm overlap
Additional context
Created MRs to temporarily prevent using mxfp8 params without dp comm overlap