Skip to content

Commit 94fed79

Browse files
authored
Merge branch 'main' into fix-simple-train-loop
2 parents fe1fffa + 4c2768a commit 94fed79

File tree

56 files changed

+10305
-1375
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+10305
-1375
lines changed

.github/pull_request_template.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# What does this PR do ?
2+
<!-- Add a one line overview of what this PR aims to accomplish. -->
3+
4+
:warning: For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.
5+
6+
## Contribution process
7+
8+
```mermaid
9+
flowchart LR
10+
A[Pre-checks] --> B[PR Tests]
11+
subgraph Code Review/Approval
12+
C1[Expert Review] --> C2[Final Review]
13+
end
14+
B --> C1
15+
C2 --> D[Merge]
16+
```
17+
18+
### Pre-checks
19+
20+
- [ ] I want this PR in a versioned release and have added the appropriate Milestone (e.g., `Core 0.8`)
21+
- [ ] I have added relevant unit tests
22+
- [ ] I have added relevant functional tests
23+
- [ ] I have added proper typing to my code [Typing guidelines](https://docs.python.org/3/library/typing.html)
24+
- [ ] I have added relevant documentation
25+
- [ ] I have run the [autoformatter.sh](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/autoformat.sh) on my PR
26+
27+
### Code review
28+
29+
The following process is enforced via the CODEOWNERS file for changes into `megatron/core`. For changes outside of `megatron/core`, it is up to the PR author whether or not to tag the Final Reviewer team.
30+
31+
<details>
32+
<summary>For MRs into `main` branch</summary>
33+
34+
#### (Step 1): Add PR label `Expert Review`
35+
36+
#### (Step 2): Collect the expert reviewers reviews
37+
38+
1. Attach the `Expert Review` label when your PR is ready for review.
39+
2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.
40+
41+
:warning: Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
42+
Final Review might get declined if these requirements are not fulfilled.
43+
44+
#### (Step 3): Final Review
45+
46+
1. Add `Final Review` label
47+
2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.
48+
49+
#### (Optional Step 4): Cherry-pick into release branch
50+
51+
If this PR also needs to be merged into `core_r*` release branches, after this PR has been merged, select `Cherry-pick` to open a new PR into the release branch.
52+
53+
</details>
54+
55+
<details>
56+
<summary>For MRs into `dev` branch</summary>
57+
The proposed review process for `dev` branch is under active discussion.
58+
59+
MRs are mergable after one approval by either `[email protected]` or `[email protected]`.
60+
</details>
61+
62+
### Merging your PR
63+
64+
Any member of [core-adlr](https://github.com/orgs/teams/NVIDIA/core-adlr) and [`core-nemo`](https://github.com/orgs/teams/NVIDIA/core-nemo) will be able to merge your PR.
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
name: Auto-assign Milestone to PR
2+
3+
on:
4+
push:
5+
branches:
6+
- "pull-request/[0-9]+"
7+
8+
permissions:
9+
contents: read
10+
pull-requests: write
11+
issues: write
12+
13+
jobs:
14+
assign-milestone:
15+
runs-on: ubuntu-latest
16+
environment: nemo-ci
17+
steps:
18+
- name: Get PR info
19+
id: get-pr-info
20+
if: startsWith(github.ref, 'refs/heads/pull-request/')
21+
uses: nv-gha-runners/get-pr-info@main
22+
23+
- name: Check if PR has milestone
24+
id: check_milestone
25+
env:
26+
GH_TOKEN: ${{ secrets.PAT }}
27+
run: |
28+
MILESTONE=$(gh pr view ${{ fromJSON(steps.get-pr-info.outputs.pr-info || '{}').number }} \
29+
--repo ${{ github.repository }} \
30+
--json milestone \
31+
--jq '.milestone.title')
32+
33+
if [ "$MILESTONE" = "null" ] || [ -z "$MILESTONE" ]; then
34+
echo "has_milestone=false" >> $GITHUB_OUTPUT
35+
else
36+
echo "has_milestone=true" >> $GITHUB_OUTPUT
37+
echo "PR already has milestone: $MILESTONE"
38+
fi
39+
40+
- name: Get most recent open milestone
41+
if: steps.check_milestone.outputs.has_milestone == 'false'
42+
id: get_milestone
43+
env:
44+
GH_TOKEN: ${{ secrets.PAT }}
45+
run: |
46+
# Get the most recent open milestone (sorted by due date, then by creation date)
47+
MILESTONE_NUMBER=$(gh api \
48+
"repos/${{ github.repository }}/milestones?state=open&sort=due_on&direction=desc" \
49+
--jq '.[0].number')
50+
51+
MILESTONE_TITLE=$(gh api \
52+
"repos/${{ github.repository }}/milestones?state=open&sort=due_on&direction=desc" \
53+
--jq '.[0].title')
54+
55+
if [ -z "$MILESTONE_NUMBER" ] || [ "$MILESTONE_NUMBER" = "null" ]; then
56+
echo "No open milestones found"
57+
echo "milestone_found=false" >> $GITHUB_OUTPUT
58+
else
59+
echo "milestone_found=true" >> $GITHUB_OUTPUT
60+
echo "milestone_number=$MILESTONE_NUMBER" >> $GITHUB_OUTPUT
61+
echo "milestone_title=$MILESTONE_TITLE" >> $GITHUB_OUTPUT
62+
echo "Found milestone: $MILESTONE_TITLE (number: $MILESTONE_NUMBER)"
63+
fi
64+
65+
- name: Assign milestone to PR
66+
if: steps.check_milestone.outputs.has_milestone == 'false' && steps.get_milestone.outputs.milestone_found == 'true'
67+
env:
68+
GH_TOKEN: ${{ secrets.PAT }}
69+
run: |
70+
gh pr edit ${{ fromJSON(steps.get-pr-info.outputs.pr-info || '{}').number }} \
71+
--repo ${{ github.repository }} \
72+
--milestone "${{ steps.get_milestone.outputs.milestone_title }}"
73+
74+
echo "✅ Assigned milestone '${{ steps.get_milestone.outputs.milestone_title }}' to PR #${{ fromJSON(steps.get-pr-info.outputs.pr-info || '{}').number }}"

.github/workflows/cherry-pick-release-commit.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,9 @@ on:
2020

2121
jobs:
2222
cherry-pick:
23-
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/[email protected]
23+
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/[email protected]
24+
with:
25+
target-branches-pattern: 'core_(*dev_)?r[0-9]+\.[0-9]+\.[0-9]+'
2426
secrets:
2527
PAT: ${{ secrets.PAT }}
2628
SLACK_WEBHOOK_ADMIN: ${{ secrets.SLACK_WEBHOOK_ADMIN }}

.github/workflows/cicd-approve-test-queue.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ jobs:
2525
environment: main
2626
strategy:
2727
matrix:
28-
branch: [main, dev]
28+
branch: [main, dev, others]
2929
steps:
3030
- name: Checkout repository
3131
uses: actions/checkout@v4
@@ -44,6 +44,7 @@ jobs:
4444
env:
4545
GITHUB_TOKEN: ${{ secrets.PAT }}
4646
MAX_CONCURRENCY: ${{ vars.MAX_CONCURRENCY || 1 }}
47+
PYTHONUNBUFFERED: 1
4748
shell: python
4849
run: |
4950
import os
@@ -99,7 +100,10 @@ jobs:
99100
return False
100101
101102
base_branch = pr_info.get("base", {}).get("ref")
102-
if base_branch == target_branch:
103+
if (
104+
(base_branch == target_branch) or
105+
(base_branch != "main" and base_branch != "dev" and target_branch == "others")
106+
):
103107
print(f"PR #{pr_number} targets {target_branch}")
104108
return True
105109

.github/workflows/cicd-main.yml

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
name: CICD Megatron-LM
1515
on:
1616
schedule:
17-
- cron: "0 */2 * * *"
17+
- cron: 0 0 * * *
1818
push:
1919
branches:
2020
- dev
@@ -23,6 +23,7 @@ on:
2323
- "deploy-release/*"
2424
merge_group:
2525
types: [checks_requested]
26+
workflow_dispatch:
2627

2728
concurrency:
2829
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-${{ github.event.label.name || 'main' }}-${{ github.event_name }}
@@ -148,7 +149,7 @@ jobs:
148149
149150
pre-flight:
150151
needs: [is-not-external-contributor]
151-
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/[email protected].5
152+
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/[email protected].10
152153

153154
linting:
154155
runs-on: ubuntu-latest
@@ -318,7 +319,7 @@ jobs:
318319
- name: Parse unit tests
319320
id: parse-unit-tests
320321
run: |
321-
cat tests/test_utils/recipes/unit-tests.yaml | yq -o json '[.products[].test_case[] | { "bucket": .}]' | jq -c > unit-tests.json
322+
cat tests/test_utils/recipes/unit-tests.yaml | yq -o json '[.products[].test_case[] | { "bucket": .}] | sort_by(.model, .test_case)' | jq -c > unit-tests.json
322323
echo "unit-tests=$(cat unit-tests.json)" | tee -a $GITHUB_OUTPUT
323324
324325
cicd-unit-tests-latest:
@@ -366,6 +367,14 @@ jobs:
366367
- cicd-wait-in-queue
367368
- cicd-container-build
368369
- cicd-unit-tests-latest
370+
if: |
371+
(
372+
success()
373+
|| needs.pre-flight.outputs.is_ci_workload == 'true'
374+
|| needs.pre-flight.outputs.force_run_all == 'true'
375+
)
376+
&& needs.pre-flight.outputs.is_merge_group == 'false'
377+
&& !cancelled()
369378
outputs:
370379
integration-tests: ${{ steps.main.outputs.integration-tests }}
371380
steps:
@@ -490,7 +499,7 @@ jobs:
490499
env:
491500
GH_TOKEN: ${{ github.token }}
492501
RUN_ID: ${{ github.run_id }}
493-
SKIPPING_IS_ALLOWED: ${{ needs.pre-flight.outputs.docs_only == 'true' || needs.pre-flight.outputs.is_deployment_workflow == 'true' || needs.pre-flight.outputs.is_merge_group == 'true' }}
502+
SKIPPING_IS_ALLOWED: ${{ needs.pre-flight.outputs.docs_only == 'true' || needs.pre-flight.outputs.is_deployment_workflow == 'true' || needs.pre-flight.outputs.is_merge_group == 'true' || needs.pre-flight.outputs.is_ci_workload == 'true' }}
494503
run: |
495504
FAILED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion == "failure")] | length') || echo 0
496505
SKIPPED_JOBS=$(gh run view $GITHUB_RUN_ID --json jobs --jq '[.jobs[] | select(.status == "completed" and .conclusion == "skipped")] | length') || echo 0

.github/workflows/community-bot.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ on:
2121

2222
jobs:
2323
community-bot:
24-
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_community_bot.yml@v0.49.1
24+
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_community_bot.yml@v0.65.10
2525
secrets:
2626
GH_TOKEN: ${{ secrets.PAT }}
27+
environment: main

.github/workflows/copyright-check.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ jobs:
3131
if: |
3232
!(needs.pre-flight.outputs.docs_only == 'true'
3333
|| needs.pre-flight.outputs.is_deployment_workflow == 'true')
34-
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/[email protected].9
34+
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/[email protected].11
3535

3636
copyright-check-summary:
3737
needs: [pre-flight, copyright-check]

.gitlab-ci.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ workflow:
1818
- if: $CI_PROJECT_NAMESPACE != "ADLR" || ($CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_PROJECT_PATH != "ADLR/megatron-lm")
1919
when: never
2020

21+
- if: $CI_PIPELINE_SOURCE == "schedule" && ($CI_COMMIT_BRANCH == 'ci-approve-dev' || $CI_COMMIT_BRANCH == 'ci-approve-main')
22+
2123
# ci-branches only for schedule
2224
- if: $CI_COMMIT_BRANCH =~ /ci-/ && $CI_PIPELINE_SOURCE != "schedule"
2325
when: never
@@ -31,15 +33,15 @@ workflow:
3133
- if: $CI_PIPELINE_SOURCE == "web"
3234

3335
# For push to main
34-
- if: $CI_PIPELINE_SOURCE == 'push' && ($CI_COMMIT_BRANCH == "main" || $CI_COMMIT_BRANCH == "dev")
36+
- if: $CI_PIPELINE_SOURCE == 'push' && ($CI_COMMIT_BRANCH == "main" || $CI_COMMIT_BRANCH == "dev" || $CI_COMMIT_BRANCH =~ /^core_/)
3537
variables:
3638
UNIT_TEST: "no"
3739
INTEGRATION_TEST: "no"
3840
FUNCTIONAL_TEST: "yes"
3941
FUNCTIONAL_TEST_SCOPE: mr
4042
FUNCTIONAL_TEST_REPEAT: 5
4143
FUNCTIONAL_TEST_RECORD_CHECKPOINTS: "no"
42-
FUNCTIONAL_TEST_TIME_LIMIT: 2700
44+
FUNCTIONAL_TEST_TIME_LIMIT: 3600
4345
CLUSTER_A100: ""
4446
CLUSTER_H100: ""
4547
PUBLISH: "no"
@@ -154,6 +156,8 @@ default:
154156
when: runner_system_failure
155157

156158
variables:
159+
BUILD:
160+
value: "yes"
157161
UNIT_TEST:
158162
value: "yes"
159163
options:

.gitlab/stages/00.pre.yml

Lines changed: 2 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ include:
88
when: always
99
- if: $CI_MERGE_REQUEST_EVENT_TYPE == 'merged_result'
1010
when: always
11+
1112
- when: never
1213
stage: .pre
1314

@@ -20,29 +21,6 @@ include:
2021
- echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
2122
- echo "$CI_REGISTRY_PASSWORD" | docker login $CI_REGISTRY -u $CI_REGISTRY_USER --password-stdin
2223

23-
pre:mirror_to_github:
24-
rules:
25-
- if: '($CI_COMMIT_BRANCH == "main" || $CI_COMMIT_BRANCH == "dev") && $CI_PIPELINE_SOURCE == "push"'
26-
allow_failure: true
27-
- when: never
28-
tags:
29-
- arch/amd64
30-
- env/prod
31-
- origin/jet-fleet
32-
- owner/jet-core
33-
- purpose/utility
34-
- team/megatron
35-
stage: .pre
36-
image: python:3.10
37-
variables:
38-
GIT_STRATEGY: "clone"
39-
script:
40-
- git checkout $CI_COMMIT_BRANCH
41-
- git remote add github https://ko3n1g:[email protected]/NVIDIA/Megatron-LM.git || true
42-
- git push -u github $CI_COMMIT_BRANCH
43-
retry:
44-
max: 2
45-
4624
pre:create_ci_branches:
4725
rules:
4826
- if: '$CI_COMMIT_BRANCH == "main" && $CI_PIPELINE_SOURCE == "push"'
@@ -60,6 +38,7 @@ pre:create_ci_branches:
6038
- branch: ci-upgrade-dependencies
6139
- branch: ci-approve-main
6240
- branch: ci-approve-dev
41+
- branch: ci-sync-branches
6342
tags:
6443
- arch/amd64
6544
- env/prod
@@ -348,53 +327,3 @@ pre:check_status_of_main:
348327
- if: $CI_MERGE_REQUEST_EVENT_TYPE == 'merge_train'
349328
when: always
350329
- when: never
351-
352-
pre:approve_merge_gate:
353-
extends: [.pre_rules]
354-
image: maniator/gh
355-
tags:
356-
- arch/amd64
357-
- env/prod
358-
- origin/jet-fleet
359-
- owner/jet-core
360-
- purpose/utility
361-
- team/megatron
362-
script:
363-
- |
364-
set -eoux pipefail
365-
EXIT_CODE=0
366-
python tests/test_utils/python_scripts/check_status_of_main.py --target-branch "$CI_COMMIT_BRANCH" --once || EXIT_CODE=$?
367-
368-
export GH_TOKEN=$GH_TOKEN
369-
export REPO=NVIDIA/Megatron-LM
370-
export TARGET_BRANCH="$CI_COMMIT_BRANCH"
371-
372-
if [[ $EXIT_CODE -eq 0 ]]; then
373-
STATUS="approved"
374-
COMMENT="Main is healthy. Submitting PR."
375-
else
376-
STATUS="rejected"
377-
COMMENT="Main is not healthy. An automation engineer is investigating. No need to take any action."
378-
fi
379-
380-
gh api "repos/$REPO/actions/runs?status=waiting" --jq '.workflow_runs[].id' \
381-
| while read run_id; do
382-
HEAD_BRANCH=$(gh api "repos/$REPO/actions/runs/$run_id" --jq '.head_branch')
383-
PR_NUMBER="${HEAD_BRANCH##*/}"
384-
if [ -n "$PR_NUMBER" ]; then
385-
PR_BASE=$(gh api "repos/$REPO/pulls/$PR_NUMBER" --jq '.base.ref')
386-
if [ "$PR_BASE" = "$TARGET_BRANCH" ]; then
387-
gh api \
388-
--method POST "repos/$REPO/actions/runs/$run_id/pending_deployments" \
389-
-F "environment_ids[]=$(gh api "repos/$REPO/environments" --jq '.environments[] | select(.name=="merge-gate") | .id')" \
390-
-f state="$STATUS" \
391-
-f comment="$COMMENT";
392-
fi
393-
fi
394-
done
395-
retry:
396-
max: 2
397-
rules:
398-
- if: $CI_PIPELINE_SOURCE == "schedule" && ($CI_COMMIT_BRANCH == 'ci-approve-dev' || $CI_COMMIT_BRANCH == 'ci-approve-main')
399-
when: always
400-
- when: never

0 commit comments

Comments
 (0)