Skip to content

mxfp8 params without dp comm overlap will cause NaN #2272

@kunlunl

Description

@kunlunl

Describe the bug

using --fp8-param-gather and --reuse-grad-buf-for-mxfp8-param-ag without --overlap-grad-reduce and --overlap-param-gather will cause NaN loss.

If both --fp8-param-gather, --reuse-grad-buf-for-mxfp8-param-ag,--overlap-grad-reduce and --overlap-param-gather are enabled, it's fine.

Steps/Code to reproduce bug

Enable --fp8-recipe=mxfp8 --fp8-param-gather --reuse-grad-buf-for-mxfp8-param-ag.

Don't use --overlap-grad-reduce and --overlap-param-gather

Expected behavior

Converge correctly without dp comm overlap

Additional context

Created MRs to temporarily prevent using mxfp8 params without dp comm overlap

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions