feat: Add Megatron-LM based training #439

SahilJain314 · 2025-05-23T02:38:10Z

No description provided.

Signed-off-by: Sahil Jain <[email protected]>

…{Dict, List, Tuple} to primitive dict, list tuple Signed-off-by: Sahil Jain <[email protected]>

Signed-off-by: Sahil Jain <[email protected]>

Signed-off-by: Terry Kong <[email protected]> wip Signed-off-by: Terry Kong <[email protected]> fix it Signed-off-by: Terry Kong <[email protected]> patthing fix Signed-off-by: Terry Kong <[email protected]> wip Signed-off-by: Terry Kong <[email protected]> doesn't look like i needed that Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> revert stuff Signed-off-by: Terry Kong <[email protected]> make it better Signed-off-by: Terry Kong <[email protected]> go Signed-off-by: Terry Kong <[email protected]> cleanup Signed-off-by: Terry Kong <[email protected]> mix it up Signed-off-by: Terry Kong <[email protected]> touch up Signed-off-by: Terry Kong <[email protected]> clean Signed-off-by: Terry Kong <[email protected]> better Signed-off-by: Terry Kong <[email protected]> clean up Signed-off-by: Terry Kong <[email protected]> add it in Signed-off-by: Terry Kong <[email protected]> mcore extra Signed-off-by: Terry Kong <[email protected]> instructions Signed-off-by: Terry Kong <[email protected]> works Signed-off-by: Terry Kong <[email protected]> revert to 3.10, 3.12 didn't seem necessary Signed-off-by: Terry Kong <[email protected]> ci has to recursively clone Signed-off-by: Terry Kong <[email protected]> bump build workflow Signed-off-by: Terry Kong <[email protected]> add megatron.core import Signed-off-by: Terry Kong <[email protected]> potential fix for unit test on CI Signed-off-by: Terry Kong <[email protected]> fix the test Signed-off-by: Terry Kong <[email protected]> this should fix test (it was a collision of namespace) Signed-off-by: Terry Kong <[email protected]> remove fp8 from test Signed-off-by: Terry Kong <[email protected]> add shallow Signed-off-by: Terry Kong <[email protected]> fix base build Signed-off-by: Terry Kong <[email protected]> fix instructions Signed-off-by: Terry Kong <[email protected]> fix the messed up indenting Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> try nesting Signed-off-by: Terry Kong <[email protected]> okay, got it to work Signed-off-by: Terry Kong <[email protected]> fix up the readme Signed-off-by: Terry Kong <[email protected]> ok Signed-off-by: Terry Kong <[email protected]> touchup Signed-off-by: Terry Kong <[email protected]> add the lock file back Signed-off-by: Terry Kong <[email protected]> got Signed-off-by: Terry Kong <[email protected]>

Signed-off-by: Terry Kong <[email protected]>

… tied worker groups Signed-off-by: Sahil Jain <[email protected]>

Signed-off-by: Sahil Jain <[email protected]>

Signed-off-by: Parth Chadha <[email protected]>

…on 0.3.0rc0 (#277) Signed-off-by: oliver könig <[email protected]> Signed-off-by: Charlie Truong <[email protected]> Co-authored-by: Charlie Truong <[email protected]>

Signed-off-by: Terry Kong <[email protected]> Co-authored-by: lw86ruwo <[email protected]>

Signed-off-by: Terry Kong <[email protected]>

Signed-off-by: Sahil Jain <[email protected]>

Signed-off-by: Terry Kong <[email protected]>

Signed-off-by: Yuki Huang <[email protected]>

… release (#461) Signed-off-by: Charlie Truong <[email protected]>

Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Sahil Jain <[email protected]>

Signed-off-by: Sahil Jain <[email protected]>

Signed-off-by: Parth Chadha <[email protected]> Signed-off-by: Terry Kong <[email protected]> Co-authored-by: Terry Kong <[email protected]>

Signed-off-by: Yi-Fu Wu <[email protected]> Signed-off-by: Sahil Jain <[email protected]> Co-authored-by: Sahil Jain <[email protected]>

Signed-off-by: Sahil Jain <[email protected]>

Signed-off-by: Wei Feng <[email protected]>

Signed-off-by: Sahil Jain <[email protected]>

terrykong · 2025-06-03T19:50:58Z

examples/configs/grpo_math_1B.yaml

  save_period: 10

 policy:
+  training_backend: "hf"


do you think we can somehow reduce the number of flags?

policy.training_backend="hf" | "megatron"

policy.dtensor_cfg.enabled=true

policy.megatron_cfg.enabled=true

so that users only need to set 1, but also so that if they enable one backend via ...enabled=true, they aren't surprised b/c they need to enable another config value

terrykong · 2025-06-03T21:52:49Z

nemo_rl/distributed/virtual_cluster.py

    VLLM = "uv run --locked --extra vllm"

+    # Megatron-Core (and NeMo deps)
+    MCORE = "uv run --extra mcore --no-build-isolation"


the one below is more reliable b/c of reinstall, this one can be deleted

terrykong · 2025-06-03T22:55:23Z

nemo_rl/models/megatron/common.py

+
+def broadcast_tensor(
+    tensor: torch.Tensor | None, src_rank: int, group: dist.ProcessGroup
+):


Suggested change

):

) -> torch.Tensor:

terrykong · 2025-06-03T22:59:37Z

nemo_rl/models/megatron/community_import.py

+# limitations under the License.
+
+
+def import_model_from_hf_name(hf_model_name: str, output_path: str):


how about the following?

Suggested change

def import_model_from_hf_name(hf_model_name: str, output_path: str):

def convert_hf_to_mcore_ckpt(hf_model_name: str, output_path: str):

terrykong · 2025-06-03T23:03:55Z

nemo_rl/models/megatron/converters/common.py

+    return global_layer_num
+
+
+class SafeDict(dict):


could you maybe add a comment that this is used for partial formatting?

terrykong · 2025-06-03T23:35:15Z

nemo_rl/models/megatron/refit_utils.py

+        torch.distributed.broadcast_object_list(
+            target_keys, src=owner_pp_global_rank, group=pp_group
+        )
+        if "None" in target_keys[0]:


should this be:

Suggested change

if "None" in target_keys[0]:

if None in target_keys[0]:

?

terrykong · 2025-06-03T23:36:35Z

nemo_rl/models/megatron/refit_utils.py

+                tensor_to_send = hf_mapping[target_key]
+            else:
+                tensor_to_send = None
+            # Broadcast tensor metadata (shape and dtype) to allocate GPU buffer on receiving ranks.


should this use your broadcast_tensor util?

terrykong · 2025-06-03T23:41:29Z