-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[training migration] add training config dataclass and arg generation utility #2306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
This reverts commit 1e160b4ce1884baa571939771d522bd9ede44c3f.
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
Signed-off-by: Maanu Grover <[email protected]>
| rampup_batch_size: Optional[list[int]] = field(default=None, metadata={"argparse_meta": {"nargs": 3}}) | ||
| """Batch size ramp up with the following values: <start batch size>, <batch size increment>, | ||
| <ramp-up samples> | ||
| For example: | ||
| rampup-batch-size = [16, 8, 300000] | ||
| global-batch-size 1024 | ||
| will start with global batch size 16 and over (1024 - 16) / 8 = 126 intervals will increase | ||
| the batch size linearly to 1024. In each interval we will use approximately | ||
| 300000 / 126 = 2380 samples. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you document the argparse_meta metadata schema in more detail in megatron/training/argument_utils.py so it can be used as a reference? otherwise other developers would have to look example usages from these configs
| class TrainingConfig: | ||
| """Configuration settings related to the training loop.""" | ||
|
|
||
| micro_batch_size: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might as well use the new style of typehints for the new code
| micro_batch_size: Optional[int] = None | |
| micro_batch_size: int | None = None |
| exit_signal: int = int(signal.SIGTERM) | ||
| """Signal for the signal handler to detect.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exit-signal in the existing argparse is a str, and the conversion to the enum happens here:
Megatron-LM/megatron/training/dist_signal_handler.py
Lines 57 to 59 in f426230
| class DistributedSignalHandler: | |
| def __init__(self, sig: str = 'SIGTERM'): | |
| self.sig = SIGNAL_MAP.get(sig, signal.SIGTERM) |
| assert args.numbers == [10, 20, 30] | ||
|
|
||
|
|
||
| class TestArgumentGroupFactoryLiteral: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also add a test for a field typed with union to test the fallback on the argparse meta?
What does this PR do ?
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
[email protected]or[email protected].Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.