Skip to content

Conversation

@youth123
Copy link

@youth123 youth123 commented Nov 19, 2025

PR Category

Train

PR Types

New Features

PR Description

  • Supports loading and saving checkpoints in nemo zarr format
  • Supports train packed seqs
  • Fix the issue where wandb finalization cannot find the latest_checkpointed_iteration file
  • Fix lora can not support layernorm weight load & not support nemo zarr

The checkpoint file format is as follows:
load zarr format:
-context
-weights
-module.decoder.xxx._extra_state
-module.decoder.xxx.weight
-optimizer.state.fp32_param.xxx.weight
-optimizer.state.fp32_param.xxx.weight.sync
common.pt
meatadata.json

save zarr format:
-iter_xxx
-module.decoder.xxx._extra_state
-module.decoder.xxx.weight
-optimizer.state.fp32_param.xxx.weight
-optimizer.state.fp32_param.xxx.weight.sync
common.pt
meatadata.json
latest_checkpointed_iteration.txt

The comparison of nemo and flagscale under different distributed strategies is as follows:
lora:

  • dp2:
image
  • tp2:
image
  • pp2:
image

full:

  • dp2:
image
  • tp2:
image
  • pp2:
image

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants