Skip to content

Conversation

@SahilJain314
Copy link
Contributor

No description provided.

SahilJain314 and others added 30 commits May 11, 2025 18:39
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
…{Dict, List, Tuple} to primitive dict, list tuple

Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Terry Kong <[email protected]>

wip

Signed-off-by: Terry Kong <[email protected]>

fix it

Signed-off-by: Terry Kong <[email protected]>

patthing fix

Signed-off-by: Terry Kong <[email protected]>

wip

Signed-off-by: Terry Kong <[email protected]>

doesn't look like i needed that

Signed-off-by: Terry Kong <[email protected]>

fix

Signed-off-by: Terry Kong <[email protected]>

revert stuff

Signed-off-by: Terry Kong <[email protected]>

make it better

Signed-off-by: Terry Kong <[email protected]>

go

Signed-off-by: Terry Kong <[email protected]>

cleanup

Signed-off-by: Terry Kong <[email protected]>

mix it up

Signed-off-by: Terry Kong <[email protected]>

touch up

Signed-off-by: Terry Kong <[email protected]>

clean

Signed-off-by: Terry Kong <[email protected]>

better

Signed-off-by: Terry Kong <[email protected]>

clean up

Signed-off-by: Terry Kong <[email protected]>

add it in

Signed-off-by: Terry Kong <[email protected]>

mcore extra

Signed-off-by: Terry Kong <[email protected]>

instructions

Signed-off-by: Terry Kong <[email protected]>

works

Signed-off-by: Terry Kong <[email protected]>

revert to 3.10, 3.12 didn't seem necessary

Signed-off-by: Terry Kong <[email protected]>

ci has to recursively clone

Signed-off-by: Terry Kong <[email protected]>

bump build workflow

Signed-off-by: Terry Kong <[email protected]>

add megatron.core import

Signed-off-by: Terry Kong <[email protected]>

potential fix for unit test on CI

Signed-off-by: Terry Kong <[email protected]>

fix the test

Signed-off-by: Terry Kong <[email protected]>

this should fix test (it was a collision of namespace)

Signed-off-by: Terry Kong <[email protected]>

remove fp8 from test

Signed-off-by: Terry Kong <[email protected]>

add shallow

Signed-off-by: Terry Kong <[email protected]>

fix base build

Signed-off-by: Terry Kong <[email protected]>

fix instructions

Signed-off-by: Terry Kong <[email protected]>

fix the messed up indenting

Signed-off-by: Terry Kong <[email protected]>

fix

Signed-off-by: Terry Kong <[email protected]>

try nesting

Signed-off-by: Terry Kong <[email protected]>

okay, got it to work

Signed-off-by: Terry Kong <[email protected]>

fix up the readme

Signed-off-by: Terry Kong <[email protected]>

ok

Signed-off-by: Terry Kong <[email protected]>

touchup

Signed-off-by: Terry Kong <[email protected]>

add the lock file back

Signed-off-by: Terry Kong <[email protected]>

got

Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
SahilJain314 and others added 21 commits May 28, 2025 14:17
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
…on 0.3.0rc0 (#277)

Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Charlie Truong <[email protected]>
Co-authored-by: Charlie Truong <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Parth Chadha <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Co-authored-by: Terry Kong <[email protected]>
Signed-off-by: Yi-Fu Wu <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Co-authored-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
save_period: 10

policy:
training_backend: "hf"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we can somehow reduce the number of flags?

  • policy.training_backend="hf" | "megatron"
  • policy.dtensor_cfg.enabled=true
  • policy.megatron_cfg.enabled=true

so that users only need to set 1, but also so that if they enable one backend via ...enabled=true, they aren't surprised b/c they need to enable another config value

VLLM = "uv run --locked --extra vllm"

# Megatron-Core (and NeMo deps)
MCORE = "uv run --extra mcore --no-build-isolation"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the one below is more reliable b/c of reinstall, this one can be deleted


def broadcast_tensor(
tensor: torch.Tensor | None, src_rank: int, group: dist.ProcessGroup
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
):
) -> torch.Tensor:

# limitations under the License.


def import_model_from_hf_name(hf_model_name: str, output_path: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about the following?

Suggested change
def import_model_from_hf_name(hf_model_name: str, output_path: str):
def convert_hf_to_mcore_ckpt(hf_model_name: str, output_path: str):

return global_layer_num


class SafeDict(dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you maybe add a comment that this is used for partial formatting?

torch.distributed.broadcast_object_list(
target_keys, src=owner_pp_global_rank, group=pp_group
)
if "None" in target_keys[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be:

Suggested change
if "None" in target_keys[0]:
if None in target_keys[0]:

?

tensor_to_send = hf_mapping[target_key]
else:
tensor_to_send = None
# Broadcast tensor metadata (shape and dtype) to allocate GPU buffer on receiving ranks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this use your broadcast_tensor util?

if megatron_checkpoint_home is not None:
pretrained_path = f"{megatron_checkpoint_home}/{hf_model_name}"
else:
pretrained_path = f"/opt/checkpoints/tron/{hf_model_name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can be moved to the signature for readability?

also, is it possible to maybe wire up the checkpoint_dir to this one since if someone doesn't mount in /opt/checkpoints into the container, the container FS may balloon for large models and you'll run out of disk (especially for k8s + large models)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I wonder how we should handle it. This dir should be analogous to HF_HOME. Maybe it should be env-var'd in the same way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, maybe env var is best to allow it to be configured in case someone has a big model (like deepseek)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option, is we can maybe we can also store things in HF_HOME to avoid having too many env vars to set? once place to get it is here https://github.com/huggingface/huggingface_hub/blob/8809f4412b3ff975db4cb9e598e2e758669ceec9/src/huggingface_hub/constants.py#L130

presumably a user would mount that in, so maybe we can claim a dir alongside HF_HOME/hub?

model_cfg.parallel_output = True

checkpoint_config = CheckpointConfig(
save_interval=100,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be plumbed from master cfg?

self.converter_type = self.cfg["megatron_cfg"]["converter_type"]
self._held_gather_buffer = None

def is_alive(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this method for?

The logprob of input token i is specified at position i in the output logprobs tensor.
"""
no_grad = torch.no_grad()
no_grad.__enter__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for the explicit ctx manager enter/exit?

- generation_lengths: Lengths of each response
"""
no_grad = torch.no_grad()
no_grad.__enter__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question about ctx manager

)

# detokenize the prompts
# detokenized_prompts = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment needed?

self.model.eval()
self.offload_before_refit()

def prepare_for_training(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wasn't able to find any references, but is there any reason this method (and it's interface) accepts *args, **kwargs?

Signed-off-by: Sahil Jain <[email protected]>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Relating to CI documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants