[training migration] add training config dataclass and arg generation utility #2306

maanug-nv · 2025-11-19T18:06:05Z

What does this PR do ?

Create dataclass for settings related to the training loop.
Create an extensible utility to generate argparse arguments from dataclasses

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either [email protected] or [email protected].

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

Signed-off-by: Maanu Grover <[email protected]>

This reverts commit 1e160b4ce1884baa571939771d522bd9ede44c3f.

Signed-off-by: Maanu Grover <[email protected]>

ananthsub · 2025-11-21T21:14:19Z

megatron/training/config.py

+    rampup_batch_size: Optional[list[int]] = field(default=None, metadata={"argparse_meta": {"nargs": 3}})
+    """Batch size ramp up with the following values: <start batch size>, <batch size increment>,
+    <ramp-up samples>
+    For example:
+        rampup-batch-size = [16, 8, 300000]
+        global-batch-size 1024
+    will start with global batch size 16 and over (1024 - 16) / 8 = 126 intervals will increase
+    the batch size linearly to 1024. In each interval we will use approximately
+    300000 / 126 = 2380 samples.
+    """


can you document the argparse_meta metadata schema in more detail in megatron/training/argument_utils.py so it can be used as a reference? otherwise other developers would have to look example usages from these configs

ananthsub · 2025-11-21T21:14:51Z

megatron/training/config.py

+class TrainingConfig:
+    """Configuration settings related to the training loop."""
+
+    micro_batch_size: Optional[int] = None


nit: might as well use the new style of typehints for the new code

Suggested change

micro_batch_size: Optional[int] = None

micro_batch_size: int | None = None

ananthsub · 2025-11-21T21:17:38Z

megatron/training/config.py

+    exit_signal: int = int(signal.SIGTERM)
+    """Signal for the signal handler to detect."""


exit-signal in the existing argparse is a str, and the conversion to the enum happens here:

Megatron-LM/megatron/training/dist_signal_handler.py

Lines 57 to 59 in f426230

class DistributedSignalHandler:

def __init__(self, sig: str = 'SIGTERM'):

self.sig = SIGNAL_MAP.get(sig, signal.SIGTERM)

ananthsub · 2025-11-21T21:24:11Z

tests/unit_tests/test_argument_utils.py

+        assert args.numbers == [10, 20, 30]
+
+
+class TestArgumentGroupFactoryLiteral:


could you also add a test for a field typed with union to test the fallback on the argparse meta?

maanug-nv added 21 commits November 17, 2025 17:03

add training loop config dataclass

ee69c1b

Signed-off-by: Maanu Grover <[email protected]>

move file

09e044d

remove variable batch size options

b118b24

Signed-off-by: Maanu Grover <[email protected]>

replace train iters with train samples

29c1294

Signed-off-by: Maanu Grover <[email protected]>

replace eval iters with eval samples

e9375f9

Signed-off-by: Maanu Grover <[email protected]>

Revert "remove variable batch size options"

1684ede

This reverts commit 1e160b4ce1884baa571939771d522bd9ede44c3f.

update some defaults and docstrings

cc56ab3

Signed-off-by: Maanu Grover <[email protected]>

first draft of factory

ff961f7

Signed-off-by: Maanu Grover <[email protected]>

set metadata for rbs

4892f2f

Signed-off-by: Maanu Grover <[email protected]>

support excluding keys

c22431f

Signed-off-by: Maanu Grover <[email protected]>

change return object

57ae59b

Signed-off-by: Maanu Grover <[email protected]>

support multiple arg names and arg name prefixes

cff582f

Signed-off-by: Maanu Grover <[email protected]>

remove some auto-generated arguments

17d745c

Signed-off-by: Maanu Grover <[email protected]>

support default factories

47f1639

Signed-off-by: Maanu Grover <[email protected]>

add iterations to skip to config

46e5684

Signed-off-by: Maanu Grover <[email protected]>

support both iters and samples for training

74ad7de

Signed-off-by: Maanu Grover <[email protected]>

split validation into separate config

79c6245

Signed-off-by: Maanu Grover <[email protected]>

add unit tests

14b348d

Signed-off-by: Maanu Grover <[email protected]>

defer to metadata if present on type check failure

e6d185a

Signed-off-by: Maanu Grover <[email protected]>

revert name changes to val config

ed96bfe

Signed-off-by: Maanu Grover <[email protected]>

more unit test coverage

5eb7a83

Signed-off-by: Maanu Grover <[email protected]>

copy-pr-bot bot temporarily deployed to nemo-ci November 19, 2025 18:06 Inactive

ko3n1g added this to the Core 0.16 milestone Nov 19, 2025

copy-pr-bot bot temporarily deployed to nemo-ci November 19, 2025 18:06 Inactive

maanug-nv added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Nov 19, 2025

copy-pr-bot bot temporarily deployed to public November 19, 2025 18:09 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 20, 2025 20:30 Inactive

copy-pr-bot bot temporarily deployed to test November 20, 2025 20:31 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 20, 2025 20:32 Inactive

copy-pr-bot bot temporarily deployed to public November 20, 2025 20:33 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 20, 2025 20:37 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 20, 2025 20:51 Inactive

yanring requested a review from Wohox November 21, 2025 02:53

maanug-nv mentioned this pull request Nov 21, 2025

[training migration] add RNG config dataclass #2347

Draft

6 tasks

ananthsub reviewed Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[training migration] add training config dataclass and arg generation utility #2306

[training migration] add training config dataclass and arg generation utility #2306

maanug-nv commented Nov 19, 2025 •

edited

Loading

Uh oh!

ananthsub Nov 21, 2025

Uh oh!

ananthsub Nov 21, 2025

Uh oh!

ananthsub Nov 21, 2025

Uh oh!

ananthsub Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	micro_batch_size: Optional[int] = None
	micro_batch_size: int \| None = None

		exit_signal: int = int(signal.SIGTERM)
		"""Signal for the signal handler to detect."""

	class DistributedSignalHandler:
	def __init__(self, sig: str = 'SIGTERM'):
	self.sig = SIGNAL_MAP.get(sig, signal.SIGTERM)

		assert args.numbers == [10, 20, 30]


		class TestArgumentGroupFactoryLiteral:

[training migration] add training config dataclass and arg generation utility #2306

Are you sure you want to change the base?

[training migration] add training config dataclass and arg generation utility #2306

Conversation

maanug-nv commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

ananthsub Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

ananthsub Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

ananthsub Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

ananthsub Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maanug-nv commented Nov 19, 2025 •

edited

Loading

(Step 1): Add PR label `Expert Review`