Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
163 commits
Select commit Hold shift + click to select a range
04cb1b0
zarr soft deprecation (#2004)
dimapihtar Oct 28, 2025
6080057
Make `get_asyncio_loop` safe to use repeatedly (#1990)
tdene Oct 28, 2025
c5ac863
Update symmetric registration interface to sync-up with upstream pyto…
youngeunkwon0405 Oct 28, 2025
cd7314a
chore: Update codeowners (#2012)
ko3n1g Oct 28, 2025
75bf979
Deduplicate dynamic engine + coordinator. (#1981)
lmcafee-nvidia Oct 28, 2025
0b06db0
Safely access state dict args in load ckpt (#1957)
maanug-nv Oct 28, 2025
afa7361
Allow mixed-batch sampling in dynamic inference (#1927)
tdene Oct 29, 2025
69d23c4
Stop Nemo_CICD_Test from failing in forks (#2024)
tdene Oct 29, 2025
e640a89
Clean up dynamic inference step (#1992)
tdene Oct 29, 2025
e6e0769
ci: Auto-update copy-pr-bot vetters (#1850)
ko3n1g Oct 29, 2025
75c0721
Have datasets account for tokenizers which incorrectly define PAD (#2…
tdene Oct 29, 2025
c7a9003
ci: Enable integration tests (#2023)
ko3n1g Oct 29, 2025
f9a1fff
ci: Fix build-push-wheel workflow (#2022)
ko3n1g Oct 29, 2025
eb0a744
chore: Update tooling for interactive jobs (#2032)
ko3n1g Oct 29, 2025
d5a9645
revert(hotfix): ci: trustees_override (#2041)
ko3n1g Oct 30, 2025
9458be9
add missing warnings import in model parallel config (#2039)
yashaswikarnati Oct 30, 2025
bb21676
Reduce-scatter implementation with FP32 accumulation (#1967)
deepakn94 Oct 30, 2025
629af78
ci(fix): Workflows on `main` (#2045)
ko3n1g Oct 30, 2025
8b42b9e
build: Bump modelopt (#2046)
ko3n1g Oct 30, 2025
27be0ce
Remove TestCaptureFreezeGC unit test. (#1978)
lmcafee-nvidia Oct 30, 2025
852870c
ci: Add multi-approval action (#2051)
ko3n1g Oct 30, 2025
4c2e1c9
ci(hotfix): Repair codeowners file
ko3n1g Oct 30, 2025
a07e00b
ci(hotfix): Set docs allowed to fail
ko3n1g Oct 31, 2025
f559059
Ko3n1g/ci/test iteration time (#2067)
ko3n1g Oct 31, 2025
818e072
ci(hotfix): Remove performance for ckpt-resume
ko3n1g Oct 31, 2025
f248fcb
Allow inference test throughput to vary by 10% (#2070)
mathemakitten Oct 31, 2025
e715d2f
ci(hotfix): Inference test pipeline
ko3n1g Oct 31, 2025
aad8761
chore: Fix autoformatter (#2073)
ko3n1g Oct 31, 2025
e3ae351
ci(hotfix): Remove iteration-time from t5
ko3n1g Oct 31, 2025
87cbe76
ci(hotfix): disable inference test
ko3n1g Nov 1, 2025
d0d00b3
ci(hotfix): Disable inference test
ko3n1g Nov 2, 2025
88e3a8a
ci(hotfix): Bypass approvalbot in merge-queue (#2082)
ko3n1g Nov 2, 2025
53305bc
ci(hotfix): Enable merge-group for approval bot
ko3n1g Nov 2, 2025
7c16ca0
chore: Update local tooling (#2066)
ko3n1g Nov 2, 2025
dc7a0ca
Add extra RL files (#2077)
tdene Nov 2, 2025
5cfad7b
Prevent summary jobs from running in forks (#2083)
tdene Nov 2, 2025
ba21b69
ci: Fix test scope (#2091)
ko3n1g Nov 2, 2025
7ca2890
ci(hotfix): Remove publish workflows
ko3n1g Nov 3, 2025
a652e2c
Refactor the attention metadata into separate classes (#2001)
kanz-nv Nov 3, 2025
65cd27c
Guard against incorrectly using MoE prefill graphs (#2030)
tdene Nov 3, 2025
d3f1af4
Revert "Refactor the attention metadata into separate classes (#2001)"
ko3n1g Nov 3, 2025
5671e3a
Run mr-slim tests in lightweight-mode (#2106)
chtruong814 Nov 3, 2025
7487c53
Inference | Lazy compile UVM allocator. (#1977)
lmcafee-nvidia Nov 3, 2025
1307f87
chore: Reenable trustees (#2108)
ko3n1g Nov 3, 2025
282b74c
Revert "Inference | Lazy compile UVM allocator. (#1977)"
ko3n1g Nov 3, 2025
2cab46f
ci(fix): Changeset of copyright checker (#2110)
ko3n1g Nov 3, 2025
d4194b7
Ko3n1g/chore/update release settings (#2097)
ko3n1g Nov 3, 2025
5dee638
Remove unnecessary check on rotary_pos_cos (#2003)
santhnm2 Nov 4, 2025
aecce9e
(Reverted) Inference | Lazy compile UVM allocator. (#2125)
lmcafee-nvidia Nov 4, 2025
1586563
Refactor Attention Metadata to Separate Classes (#2112)
kanz-nv Nov 4, 2025
46e066b
Refactor model_provider to model_builder format for ModelOpt examples…
AAnoosheh Nov 5, 2025
26b2eb5
wandb Inference stats logging (#2026)
wdykas Nov 5, 2025
9be6d47
Make `PipelineParallelLayout` always return str from ` __repr__` (#2055)
ananthsub Nov 5, 2025
a32ff75
Add flash_attn_3 as first option for FA3 import (#2010)
santhnm2 Nov 5, 2025
f119a06
Add debugging hint for case when cudagraphs are created but no matchi…
mathemakitten Nov 5, 2025
eb48e81
ci: LTS container (#2133)
ko3n1g Nov 5, 2025
75f87c2
Revert "ci: LTS container (#2133)"
ko3n1g Nov 5, 2025
08c3771
Fix param init (#2033)
cuichenx Nov 5, 2025
f150f42
Hotfix to unit tests on hopper FA3 (#2143)
tdene Nov 5, 2025
10146c6
Add BytesIO to safe_globals (#2074)
tdene Nov 6, 2025
f167a85
add deprecation warning for legacy tokenizer system (#2145)
dimapihtar Nov 6, 2025
23a1dca
replay: ci: Bump LTS container (#2157)
ko3n1g Nov 6, 2025
0abff08
Hotfix to unit tests on hopper FA3 (bis) (#2179)
tdene Nov 7, 2025
0981e3c
Fix has_modelopt_state() for native Torch checkpoint format (#2160)
AAnoosheh Nov 7, 2025
c63b921
chore: Remove codeowners (#2175)
ko3n1g Nov 7, 2025
9aa14ed
Fix FP8 inference with sequence parallelism (#2009)
santhnm2 Nov 7, 2025
0f8fb9b
Replace ModelOpt generation server (#2147)
AAnoosheh Nov 7, 2025
e07c4a4
Add hybrid model support for dynamic inference engine (#1907)
santhnm2 Nov 7, 2025
82e846d
Async task and event loop safety in Megatron Core (#2025)
tdene Nov 10, 2025
c193bf5
Rename skip_prompt_log_probs (#2181)
tdene Nov 10, 2025
d6979d6
Dynamic inference context | UVM only. (#1983)
lmcafee-nvidia Nov 10, 2025
a59223d
Update copy-pr-bot.yaml [skip ci]
ko3n1g Nov 10, 2025
7055186
Revert "Dynamic inference context | UVM only. (#1983)"
ko3n1g Nov 10, 2025
75f7d50
ci: Run `auto-update-copy-pr-bot` only on forks (#2191)
ko3n1g Nov 10, 2025
2fef6bb
Inference throughput tests: refactor goldens to be in list format (#2…
mathemakitten Nov 10, 2025
1f6cde8
Enable TE custom quantization recipe (#2005)
negvet Nov 11, 2025
0acf6c2
Add MoE parameters to ModelOpt pruning example + conf fixes (#2205)
kevalmorabia97 Nov 11, 2025
49061f1
Add repr to pg collection class (#2089)
yashaswikarnati Nov 11, 2025
265af20
Move `data_samplers.py` from `legacy` to `training.datasets` & add `D…
asolergi-nv Nov 11, 2025
d82a6d8
Fix Megatron-FSDP checkpoint save failure (#2138)
shjwudp Nov 12, 2025
bcf2a59
Fix moe CODEOWNERS. (#2200)
jaredcasper Nov 12, 2025
08360ec
chore: Update LICENSE (#2219)
ko3n1g Nov 12, 2025
45b40bb
remove `megatron.training` dependency from `megatron.core` for FSDP c…
ananthsub Nov 12, 2025
909c746
Revert "remove `megatron.training` dependency from `megatron.core` fo…
ko3n1g Nov 12, 2025
7db8ae4
Tensorize dynamic inference mixed sampling (#2105)
tdene Nov 12, 2025
ac9221d
Revert "Tensorize dynamic inference mixed sampling (#2105)"
ko3n1g Nov 12, 2025
989d13e
Add unit test for inference DP coordinator (#2187)
tdene Nov 12, 2025
bb5a0fd
Inference linear layer (#1908)
sidsingh-nvidia Nov 12, 2025
34932c7
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Nov 13, 2025
b958982
ci(hotfix): Auto-update copy-pr-bot
github-actions[bot] Nov 13, 2025
dbc4a4f
chore: Prefer Nvidia email addresses for reminder bot (#2221)
ko3n1g Nov 13, 2025
aa4ec99
[Megatron-FSDP] Fix hang caused by non-deterministic reduce-scatter (…
shjwudp Nov 13, 2025
9d91916
Remove qwen symlink to fix for case-insensitive FS (#2235)
kevalmorabia97 Nov 13, 2025
7d3f4a0
Optimizer refactor: clean up public `get_megatron_optimizer` interfac…
deepakn94 Nov 13, 2025
9fd43fa
Fix CI for PR#1983 (#2245)
lmcafee-nvidia Nov 13, 2025
70f85eb
Enable kv cache in training for eagle (#1895)
yeyu-nvidia Nov 13, 2025
b7ef391
Fix aux-loss logging for hybrid models (#2197)
deepakn94 Nov 13, 2025
610a75e
Update flops calculation (for throughput) for hybrid MoEs (#2198)
deepakn94 Nov 13, 2025
2751749
Add MoE layer type to hybrid models (#2196)
deepakn94 Nov 13, 2025
9be7c7b
Tensorize dynamic inference mixed sampling (bis) (#2231)
tdene Nov 14, 2025
c4f83f0
Revert "Add MoE layer type to hybrid models (#2196)"
ko3n1g Nov 14, 2025
41eecc4
ci(hotfix): Checkout repo before install check
ko3n1g Nov 14, 2025
c4ba666
chore: Fix codeowners (#2264)
ko3n1g Nov 15, 2025
4696d42
Allow loading checkpoint from iteration 0 (#2199)
ananthsub Nov 17, 2025
a2d8519
ci: Skip install test in merge queue (#2281)
chtruong814 Nov 17, 2025
9a1c0d0
Add MoE layer type to hybrid models (#2259)
deepakn94 Nov 18, 2025
3df2009
Add the Hybrid-EP backend to the Flex Dispatcher (#2176)
Autumn1998 Nov 18, 2025
e8b9df1
[MAIN][NVFP4] Support NVFP4 MOE with Proper Padding (#1985)
zhongbozhu Nov 18, 2025
a755887
Update ModelOpt example readmes and advanced usage (#2273)
kevalmorabia97 Nov 18, 2025
dcd3b39
Fix UVM compatibility with CUDA 13. (#2243)
lmcafee-nvidia Nov 18, 2025
5e3fa28
ci: Add flaky marker to LTS tests (#2290)
ko3n1g Nov 18, 2025
29eed5d
Dynamic engine suspend/resume via prefill. (#1982)
lmcafee-nvidia Nov 18, 2025
3b83c3f
Revert "Dynamic engine suspend/resume via prefill. (#1982)"
ko3n1g Nov 18, 2025
19d0422
fix: Pass the timeout argument for the EP group (#2268)
yanring Nov 19, 2025
efdc681
JIT for MoE router and preprocess (#1919)
yaox12 Nov 19, 2025
00884a8
Hotfix to CI, until the fix gets reviewed (#2298)
tdene Nov 19, 2025
f885d9c
Add functional test for DP coordinator throughput (#2189)
tdene Nov 19, 2025
70db86a
Add asyncio Queue like in Python 3.13 (#2224)
tdene Nov 19, 2025
744505e
Fixes for PR#1982 (#2303)
lmcafee-nvidia Nov 19, 2025
314a378
Fix PP KV cache allocation and enable multi-node PP inference (#2182)
santhnm2 Nov 19, 2025
21968ea
Revert active-buffer-size-gb arg name. (#2257)
lmcafee-nvidia Nov 19, 2025
712dff8
feat: check: api backwards compatibility (#2251)
pablo-garay Nov 19, 2025
6c8cdd5
Add MambaInferenceStateConfig dataclass (#2265)
santhnm2 Nov 19, 2025
dc473f9
Fix typo in inference example (#2311)
santhnm2 Nov 20, 2025
7dec856
feat: initialization of API backward compatibility verification (#2310)
pablo-garay Nov 20, 2025
e4b7259
Fix Mamba TP and remove confusing legacy initialization (#2202)
jaredcasper Nov 20, 2025
8463257
Refactor KD to use ModelOpt plugins file (#2305)
AAnoosheh Nov 20, 2025
7e18da2
Revert "Refactor KD to use ModelOpt plugins file (#2305)"
ko3n1g Nov 20, 2025
8e830a1
Fix dynamic context syntax and remove redundant tensors (#2336)
kanz-nv Nov 20, 2025
ce31c5a
Working
ArEsKay3 Nov 19, 2025
44596ff
Add RL unit tests
tdene Nov 20, 2025
142da67
OPTIONAL COMMIT: Compatibility with main
tdene Nov 20, 2025
831c142
OPTIONAL COMMIT: Do not depend on zmq_control
tdene Nov 20, 2025
c1726bc
OPTIONAL COMMIT: Debugging prints
tdene Nov 20, 2025
475d7fa
Improve asyncio exception handling (#2300)
tdene Nov 20, 2025
39529e2
Merge remote-tracking branch 'gh/main' into tde/rl_4_out_of_4
tdene Nov 20, 2025
749396d
Revert "OPTIONAL COMMIT: Debugging prints"
tdene Nov 20, 2025
c07148a
asyncio loop debug
tdene Nov 20, 2025
a2460c6
Address reviewer comments
tdene Nov 20, 2025
5ab6392
ci: Upload to testpypi only on main (#2342)
ko3n1g Nov 21, 2025
0634924
implement graph config (#2203)
kanz-nv Nov 21, 2025
ddc55cd
Revert "implement graph config (#2203)"
ko3n1g Nov 21, 2025
f7fb5ec
feat: required check adjustment (#2350)
pablo-garay Nov 21, 2025
536af6c
Address reviewer comments
tdene Nov 21, 2025
70cb117
Add copyright
tdene Nov 21, 2025
0c4fadc
Merge remote-tracking branch 'gh/main' into tde/rl_4_out_of_4
tdene Nov 21, 2025
2e1e183
Address reviewer comments
tdene Nov 21, 2025
f426230
Change default baseline commit for api compat check
pablo-garay Nov 21, 2025
721eff6
Merge branch 'main' into tde/rl_4_out_of_4
tdene Nov 21, 2025
ec4fb25
Change priority of loop returned by get_asyncio_loop to prioritize re…
Nov 21, 2025
5784ab7
Merge branch 'tde/rl_4_out_of_4' of github.com:tdene/Megatron-LM into…
Nov 21, 2025
f07cb14
fix: load iteration 0 for release checkpoints (#2351)
ananthsub Nov 21, 2025
81a87e2
Break apart dynamic inference step into 2 methods (#2192)
tdene Nov 21, 2025
c90160d
Bugfix for Mamba with Chunked-Prefill (#2293)
sidsingh-nvidia Nov 21, 2025
0f78d05
mocking mpu so get_expert_data_parallel_world_size is defined for git…
jalbericiola Nov 21, 2025
c9d2c8f
Explicitly zero out padding token activations for dynamic inference (…
santhnm2 Nov 21, 2025
63d4e7d
Refactor KD to use ModelOpt plugins file (v2) (#2355)
AAnoosheh Nov 21, 2025
29a810e
Prevent unnecessarily overwriting the default Hugging Face chat templ…
santhnm2 Nov 21, 2025
7994405
add FIM dataset support (#2291)
dimapihtar Nov 21, 2025
a6d7ac1
Merge remote-tracking branch 'gh/main' into tde/rl_4_out_of_4
tdene Nov 21, 2025
e29f89c
Merge
tdene Nov 21, 2025
e35495d
Update DEFAULT_BASELINE in workflow configuration
pablo-garay Nov 22, 2025
fa549a2
Merge branch 'main' into tde/rl_4_out_of_4
tdene Nov 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
47 changes: 25 additions & 22 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,47 +1,50 @@
megatron/core @NVIDIA/core-adlr @NVIDIA/core-nemo
megatron/core/ @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/models/gpt/ @NVIDIA/gpt
megatron/core/models/gpt/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/gpt

megatron/core/models/multimodal/ @NVIDIA/multi-modal
megatron/core/models/multimodal/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/multi-modal

megatron/core/models/mamba/ @NVIDIA/hybrid-mamba
megatron/core/models/mamba/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/hybrid-mamba
megatron/core/ssm/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/hybrid-mamba

megatron/core/dist_checkpointing/ @NVIDIA/dist-checkpointing
megatron/core/datasets/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/datasets

megatron/core/optimizer/distrib_optimizer/ @NVIDIA/dist-optimizer
megatron/core/distributed/fsdp/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/megatron-fsdp

megatron/core/inference/modelopt_support @NVIDIA/quantization-and-inference
megatron/core/transformer/fsdp_dtensor_checkpoint.py @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/megatron-fsdp

# megatron/core/datasets/ @NVIDIA/datasets
megatron/core/dist_checkpointing/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/dist-checkpointing

megatron/core/pipeline_parallel/ @NVIDIA/pipeline-parallelism
megatron/core/optimizer/distrib_optimizer/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/dist-optimizer

megatron/core/inference/modelopt_support @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/quantization-and-inference

megatron/core/datasets/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/datasets

megatron/core/pipeline_parallel/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/pipeline-parallelism

megatron/core/transformer/ @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/transformer/moe/ @NVIDIA/core-adlr @NVIDIA/core-devtech
megatron/core/transformer/moe/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/mixture-of-experts-adlr @NVIDIA/mixture-of-experts-devtech

# megatron/core/inference/ @NVIDIA/inference
megatron/core/inference/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/inference

megatron/core/parallel_state.py @NVIDIA/core-nemo
megatron/core/parallel_state.py @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/post_training/ @NVIDIA/post-training
megatron/post_training @NVIDIA/post-training
megatron/core/post_training/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/post-training

megatron/post_training/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/post-training

.gitlab/ @NVIDIA/ci
.github/ @NVIDIA/ci
.gitlab-ci.yml @NVIDIA/ci
docker/ @NVIDIA/ci
tests/unit_tests/run_ci_test.sh @NVIDIA/ci
tests/test_utils/python_scripts/
tests/functional_tests/python_test_utils/ @NVIDIA/ci
tests/functional_tests/shell_test_utils/ @NVIDIA/ci
megatron/core/transformer/transformer_block.py @NVIDIA/ci
megatron/core/transformer/transformer_layer.py @NVIDIA/ci
tests/functional_tests/test_cases/ @NVIDIA/ci
tests/functional_tests/recipes/ @NVIDIA/ci
tests/unit_tests/ @NVIDIA/ci
tests/test_utils/recipes/ @NVIDIA/ci
tests/unit_tests/run_ci_test.sh @NVIDIA/ci

megatron/rl/ @NVIDIA/reinforcement-learning
examples/rl/ @NVIDIA/reinforcement-learning
test/unit_tests/test_rl_utils.py @NVIDIA/reinforcement-learning
train_rl.py @NVIDIA/reinforcement-learning
train_rl.py @NVIDIA/reinforcement-learning
23 changes: 17 additions & 6 deletions .github/actions/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ runs:
export PYTHONPATH=$(pwd)
export NEMORUN_HOME=$(pwd)
pip install --no-cache-dir uv
uv sync --only-group test
uv sync --only-group test
uv run python tests/test_utils/python_scripts/launch_nemo_run_workload.py \
--scope unit-tests \
--model unit-tests \
Expand All @@ -90,7 +90,7 @@ runs:

RUN_TEST_EOF
)
echo "$cmd" | tee "job.sh"
echo "$cmd" | tee "job.sh"
echo "::endgroup::"

- name: Get PR info
Expand Down Expand Up @@ -125,23 +125,34 @@ runs:
#!/bin/bash
set -euxo pipefail

if [ "${{ steps.has-run-tests-label.outputs.main }}" == "true" ]; then
ARGS=(
--scope mr-github
--enable-lightweight-mode
)
else
ARGS=(
--scope mr-slim
--enable-lightweight-mode
)
fi

export PYTHONPATH=$(pwd)
export NEMORUN_HOME=$(pwd)
pip install --no-cache-dir uv
uv sync --only-group test
uv sync --only-group test
uv run python tests/test_utils/python_scripts/launch_nemo_run_workload.py \
--scope mr \
${ARGS[@]} \
--model ${{ inputs.model }} \
--test-case ${{ inputs.test_case }} \
--environment dev \
--platform dgx_h100 \
--container-image ${{ inputs.container-image }} \
--data-dir /mnt/datadrive/TestData/megatron-lm/artifacts \
--enable-lightweight-mode

RUN_TEST_EOF
)
echo "$cmd" | tee "job.sh"
echo "$cmd" | tee "job.sh"
echo "::endgroup::"

- name: Set timeout
Expand Down
1 change: 1 addition & 0 deletions .github/copy-pr-bot.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
enabled: true
auto_sync_draft: false
auto_sync_ready: true
trustees_override: ["AAnoosheh", "ArEsKay3", "Autumn1998", "BestJuly", "BoxiangW", "ChenhanYu", "FDecaYed", "HaochenYuan", "ISEEKYAN", "JRD971000", "QiZhangNV", "ShriyaRishab", "Victarry", "Wohox", "ZhiyuLi-Nvidia", "aklife97", "ananthsub", "asolergi-nv", "buptzyb", "chtruong814", "cspades", "cuichenx", "deepakn94", "dimapihtar", "duncanriach", "erhoo82", "ericharper", "fanshiqing", "gautham-kollu", "guyueh1", "hxbai", "jaredcasper", "jiemingz", "jkamalu", "jon-barker", "kanz-nv", "kevalmorabia97", "ko3n1g", "kunlunl", "kvareddy", "layalir", "lhb8125", "lmcafee-nvidia", "maanug-nv", "mathemakitten", "matthieule", "mehraakash", "mkhona-nvidia", "pablo-garay", "parthmannan", "pthombre", "rogerwaleffe", "sanandaraj5597", "santhnm2", "sbak5", "shanmugamr1992", "shifangx", "shjwudp", "sidsingh-nvidia", "skyw", "tdene", "theothermike", "thomasdhc", "trintamaki", "tylerpoon", "wdykas", "xiaoyao0115", "xuwchen", "yanring", "yaox12", "yaoyu-33", "yashaswikarnati", "yeyu-nvidia", "yobibyte", "youngeunkwon0405", "yuzhongw-nvidia", "zhongbozhu"]
157 changes: 157 additions & 0 deletions .github/workflows/_build_test_publish_wheel.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
on:
workflow_call:
secrets:
TWINE_USERNAME:
required: true
TWINE_PASSWORD:
required: true

jobs:
build-and-test-wheels:
strategy:
fail-fast: false
matrix:
include:
- PACKAGE: megatron-core
PLATFORM: arm64
IMAGE: quay.io/pypa/manylinux_2_28_aarch64
- PACKAGE: megatron-core
PLATFORM: amd64
IMAGE: quay.io/pypa/manylinux_2_28_x86_64
- PACKAGE: megatron-fsdp
IMAGE: quay.io/pypa/manylinux_2_28_x86_64
PLATFORM: amd64
runs-on: ${{ matrix.PLATFORM == 'amd64' && 'ubuntu-22.04' || 'ubuntu-22.04-arm' }}
env:
PACKAGE: ${{ matrix.PACKAGE }}
IMAGE: ${{ matrix.IMAGE }}
PLATFORM: ${{ matrix.PLATFORM }}
steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Build wheel
id: build-wheel
run: |
set -x

PUBLISH_DRYRUN=yes

if [ "$PACKAGE" = "megatron-core" ]; then
ROOTDIR="megatron/core"
BUILD_DIR="."
elif [ "$PACKAGE" = "megatron-fsdp" ]; then
ROOTDIR="megatron/core/distributed/fsdp/src/megatron_fsdp"
BUILD_DIR="megatron/core/distributed/fsdp/src"
else
echo Unknown package: $PACKAGE
exit 1
fi

if [ "$PUBLISH_DRYRUN" = "yes" ]; then
PRE_RELEASE=$(sed -n "s/.*PRE_RELEASE = '\(.*\)'/\1/p" $ROOTDIR/package_info.py)
sed -i "/^PRE_RELEASE/c\PRE_RELEASE = '${PRE_RELEASE}.dev$((RANDOM % 900000 + 100000))'" $ROOTDIR/package_info.py
fi

pushd $BUILD_DIR
rm LICENSE || true
docker run --rm -v $(pwd):/workspace -w /workspace $IMAGE bash -c '\
for python_version in cp310 cp311 cp312 cp313; do \
/opt/python/${python_version}-${python_version}/bin/pip install --upgrade "setuptools>=80.0.0" build; \
done && \
for python_version in cp310 cp311 cp312 cp313; do \
/opt/python/${python_version}-${python_version}/bin/python -m build; \
done \
'

PLATFORM_WHEELS=$(find dist -name "*.whl" -not -name "*-none-any.whl")
if [ -n "$PLATFORM_WHEELS" ]; then
echo "Found platform wheels to repair: $PLATFORM_WHEELS"
docker run --rm -v $(pwd):/workspace -w /workspace $IMAGE auditwheel repair $PLATFORM_WHEELS
docker run --rm -v $(pwd):/workspace -w /workspace $IMAGE rm -rf dist/*.whl
docker run --rm -v $(pwd):/workspace -w /workspace $IMAGE cp -a wheelhouse/* dist/
fi
popd

pushd $ROOTDIR
EXPECTED_RELEASE_NUMBER=$(python -c "import package_info; print(package_info.__version__)")
popd

echo "expected-release-number=$EXPECTED_RELEASE_NUMBER" | tee -a "${GITHUB_OUTPUT}"

if [ "$PACKAGE" = "megatron-fsdp" ]; then
mkdir -p dist/
cp -a megatron/core/distributed/fsdp/src/dist/* dist/
fi

ls -al dist/

- name: Test wheels
run: |
ls -al dist/

if [ "$PACKAGE" = "megatron-core" ]; then
ROOTPATH="megatron.core"
WHEEL_PREFIX="megatron_core"
elif [ "$PACKAGE" = "megatron-fsdp" ]; then
ROOTPATH="megatron_fsdp"
WHEEL_PREFIX="megatron_fsdp"
else
echo Unknown package: $PACKAGE
exit 1
fi

if [ "$PACKAGE" = "megatron-core" ]; then
if [[ "$PLATFORM" == "arm64" ]]; then
for file in dist/$WHEEL_PREFIX*cp310*aarch64.whl; do
pip install --no-cache-dir "$file"
done
else
for file in dist/$WHEEL_PREFIX*cp310*x86_64.whl; do
pip install --no-cache-dir "$file"
done
fi
else
pip install --no-cache-dir dist/$WHEEL_PREFIX*.whl
fi

sudo rm -rf megatron/

RELEASE_NUMBER=$(python -c "import $ROOTPATH; print($ROOTPATH.__version__)")
test "${{ steps.build-wheel.outputs.expected-release-number }}" == "$RELEASE_NUMBER"

- name: Upload wheels
uses: actions/upload-artifact@v4
with:
name: wheels-${{ matrix.PACKAGE }}-${{ matrix.PLATFORM }}
path: dist/

publish-wheels:
needs: [build-and-test-wheels]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/r')) && 'main' || 'public' }}
strategy:
fail-fast: false
matrix:
include:
- PACKAGE: megatron_core
- PACKAGE: megatron_fsdp
env:
PACKAGE: ${{ matrix.PACKAGE }}
steps:
- name: Download wheels
uses: actions/download-artifact@v4
with:
path: dist/
merge-multiple: true

- name: Publish wheels
env:
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
TWINE_REPOSITORY: ${{ (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/r')) && 'pypi' || 'testpypi' }}
run: |
ls -al dist/$PACKAGE*
pip install twine
twine upload -r $TWINE_REPOSITORY -u $TWINE_USERNAME -p $TWINE_PASSWORD dist/$PACKAGE*
63 changes: 63 additions & 0 deletions .github/workflows/auto-update-copy-pr-bot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: Auto Update Copy PR Bot

on:
workflow_dispatch:
schedule:
- cron: "0 0 * * *"

jobs:
auto-update-copy-pr-bot:
runs-on: ubuntu-latest
environment: nemo-ci
if: github.repository == 'NVIDIA/Megatron-LM'
steps:
- name: Checkout code
uses: actions/checkout@v3

- name: Fetch list of members in mcore-reviewers team
shell: bash -euxo pipefail {0}
env:
GH_TOKEN: ${{ secrets.PAT }}
run: |
#!/bin/bash

get_members() {
local org=$1 team=$2 seen_file=$3

gh api "/orgs/$org/teams/$team/members" --paginate --jq '.[].login' >> "$seen_file"

gh api "/orgs/$org/teams/$team/teams" --paginate --jq '.[].slug' | while read -r child; do
get_members "$org" "$child" "$seen_file"
done

cat "$seen_file"
}

tmp=$(mktemp)
echo "" > final.txt
get_members "NVIDIA" "mcore-engineers" "$tmp" | sort -u >> final.txt && rm "$tmp"

tmp=$(mktemp)
get_members "NVIDIA" "mcore-reviewers" "$tmp" | sort -u >> final.txt && rm "$tmp"

cat final.txt | jq -sR 'split("\n") | map(select(. != "")) | flatten | unique'

export TRUSTEES=$(cat final.txt | jq -csR 'split("\n") | map(select(. != "")) | flatten | unique')
yq '.trustees_override = env(TRUSTEES)' .github/copy-pr-bot.yaml | yq -o yaml > .github/copy-pr-bot.yaml.new

mv .github/copy-pr-bot.yaml.new .github/copy-pr-bot.yaml

- name: Commit changes
env:
GH_TOKEN: ${{ secrets.PAT }}
run: |
git remote set-url origin https://x-access-token:${GH_TOKEN}@github.com/NVIDIA/Megatron-LM.git
git config --global user.name "GitHub Actions"
git config --global user.email "github-actions[bot]@users.noreply.github.com"
git add .github/copy-pr-bot.yaml
if git diff --cached --exit-code --quiet; then
echo "No changes to commit. Exiting gracefully."
exit 0
fi
git commit -m "Update copy-pr-bot.yaml [skip ci]"
git push -u origin main
Loading