-
Notifications
You must be signed in to change notification settings - Fork 210
feat: Add Megatron-LM based training #439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
…{Dict, List, Tuple} to primitive dict, list tuple
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Terry Kong <[email protected]> wip Signed-off-by: Terry Kong <[email protected]> fix it Signed-off-by: Terry Kong <[email protected]> patthing fix Signed-off-by: Terry Kong <[email protected]> wip Signed-off-by: Terry Kong <[email protected]> doesn't look like i needed that Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> revert stuff Signed-off-by: Terry Kong <[email protected]> make it better Signed-off-by: Terry Kong <[email protected]> go Signed-off-by: Terry Kong <[email protected]> cleanup Signed-off-by: Terry Kong <[email protected]> mix it up Signed-off-by: Terry Kong <[email protected]> touch up Signed-off-by: Terry Kong <[email protected]> clean Signed-off-by: Terry Kong <[email protected]> better Signed-off-by: Terry Kong <[email protected]> clean up Signed-off-by: Terry Kong <[email protected]> add it in Signed-off-by: Terry Kong <[email protected]> mcore extra Signed-off-by: Terry Kong <[email protected]> instructions Signed-off-by: Terry Kong <[email protected]> works Signed-off-by: Terry Kong <[email protected]> revert to 3.10, 3.12 didn't seem necessary Signed-off-by: Terry Kong <[email protected]> ci has to recursively clone Signed-off-by: Terry Kong <[email protected]> bump build workflow Signed-off-by: Terry Kong <[email protected]> add megatron.core import Signed-off-by: Terry Kong <[email protected]> potential fix for unit test on CI Signed-off-by: Terry Kong <[email protected]> fix the test Signed-off-by: Terry Kong <[email protected]> this should fix test (it was a collision of namespace) Signed-off-by: Terry Kong <[email protected]> remove fp8 from test Signed-off-by: Terry Kong <[email protected]> add shallow Signed-off-by: Terry Kong <[email protected]> fix base build Signed-off-by: Terry Kong <[email protected]> fix instructions Signed-off-by: Terry Kong <[email protected]> fix the messed up indenting Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> try nesting Signed-off-by: Terry Kong <[email protected]> okay, got it to work Signed-off-by: Terry Kong <[email protected]> fix up the readme Signed-off-by: Terry Kong <[email protected]> ok Signed-off-by: Terry Kong <[email protected]> touchup Signed-off-by: Terry Kong <[email protected]> add the lock file back Signed-off-by: Terry Kong <[email protected]> got Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
… tied worker groups Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Parth Chadha <[email protected]>
…on 0.3.0rc0 (#277) Signed-off-by: oliver könig <[email protected]> Signed-off-by: Charlie Truong <[email protected]> Co-authored-by: Charlie Truong <[email protected]>
Signed-off-by: Terry Kong <[email protected]> Co-authored-by: lw86ruwo <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
… release (#461) Signed-off-by: Charlie Truong <[email protected]>
Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Parth Chadha <[email protected]> Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]>
Signed-off-by: Yi-Fu Wu <[email protected]> Signed-off-by: Sahil Jain <[email protected]> Co-authored-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Wei Feng <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
| save_period: 10 | ||
|
|
||
| policy: | ||
| training_backend: "hf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we can somehow reduce the number of flags?
policy.training_backend="hf" | "megatron"policy.dtensor_cfg.enabled=truepolicy.megatron_cfg.enabled=true
so that users only need to set 1, but also so that if they enable one backend via ...enabled=true, they aren't surprised b/c they need to enable another config value
| VLLM = "uv run --locked --extra vllm" | ||
|
|
||
| # Megatron-Core (and NeMo deps) | ||
| MCORE = "uv run --extra mcore --no-build-isolation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the one below is more reliable b/c of reinstall, this one can be deleted
|
|
||
| def broadcast_tensor( | ||
| tensor: torch.Tensor | None, src_rank: int, group: dist.ProcessGroup | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ): | |
| ) -> torch.Tensor: |
| # limitations under the License. | ||
|
|
||
|
|
||
| def import_model_from_hf_name(hf_model_name: str, output_path: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about the following?
| def import_model_from_hf_name(hf_model_name: str, output_path: str): | |
| def convert_hf_to_mcore_ckpt(hf_model_name: str, output_path: str): |
| return global_layer_num | ||
|
|
||
|
|
||
| class SafeDict(dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you maybe add a comment that this is used for partial formatting?
| torch.distributed.broadcast_object_list( | ||
| target_keys, src=owner_pp_global_rank, group=pp_group | ||
| ) | ||
| if "None" in target_keys[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be:
| if "None" in target_keys[0]: | |
| if None in target_keys[0]: |
?
| tensor_to_send = hf_mapping[target_key] | ||
| else: | ||
| tensor_to_send = None | ||
| # Broadcast tensor metadata (shape and dtype) to allocate GPU buffer on receiving ranks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this use your broadcast_tensor util?
| if megatron_checkpoint_home is not None: | ||
| pretrained_path = f"{megatron_checkpoint_home}/{hf_model_name}" | ||
| else: | ||
| pretrained_path = f"/opt/checkpoints/tron/{hf_model_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this can be moved to the signature for readability?
also, is it possible to maybe wire up the checkpoint_dir to this one since if someone doesn't mount in /opt/checkpoints into the container, the container FS may balloon for large models and you'll run out of disk (especially for k8s + large models)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I wonder how we should handle it. This dir should be analogous to HF_HOME. Maybe it should be env-var'd in the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, maybe env var is best to allow it to be configured in case someone has a big model (like deepseek)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option, is we can maybe we can also store things in HF_HOME to avoid having too many env vars to set? once place to get it is here https://github.com/huggingface/huggingface_hub/blob/8809f4412b3ff975db4cb9e598e2e758669ceec9/src/huggingface_hub/constants.py#L130
presumably a user would mount that in, so maybe we can claim a dir alongside HF_HOME/hub?
Signed-off-by: Sahil Jain <[email protected]>
| model_cfg.parallel_output = True | ||
|
|
||
| checkpoint_config = CheckpointConfig( | ||
| save_interval=100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be plumbed from master cfg?
| self.converter_type = self.cfg["megatron_cfg"]["converter_type"] | ||
| self._held_gather_buffer = None | ||
|
|
||
| def is_alive(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this method for?
| The logprob of input token i is specified at position i in the output logprobs tensor. | ||
| """ | ||
| no_grad = torch.no_grad() | ||
| no_grad.__enter__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for the explicit ctx manager enter/exit?
| - generation_lengths: Lengths of each response | ||
| """ | ||
| no_grad = torch.no_grad() | ||
| no_grad.__enter__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question about ctx manager
| ) | ||
|
|
||
| # detokenize the prompts | ||
| # detokenized_prompts = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment needed?
| self.model.eval() | ||
| self.offload_before_refit() | ||
|
|
||
| def prepare_for_training(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wasn't able to find any references, but is there any reason this method (and it's interface) accepts *args, **kwargs?
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
Signed-off-by: Sahil Jain <[email protected]>
No description provided.