code refactor : making key steps modular train_step() #1650

Shagun-G · 2025-08-28T16:24:56Z

Summary:
The refactor creates sperate functions for key steps within the train_step function to allow easy changes for future work. It creates 3 new functions:

gradient_computation : computes gradients via accumulation
run_optimizer_Step : runs all optimizer steps
compute_global_loss : calculates the global loss when logging

The diff also creates a seperate function should_continue_training() to expand possible criterions for termination in the future without changing the train function.

Rollback Plan:

Differential Revision: D81185673

Summary: The refactor creates sperate functions for key steps within the train_step function to allow easy changes for future work. It creates 3 new functions: 1. gradient_computation : computes gradients via accumulation 2. run_optimizer_Step : runs all optimizer steps 3. compute_global_loss : calculates the global loss when logging The diff also creates a seperate function should_continue_training() to expand possible criterions for termination in the future without changing the train function. Rollback Plan: Differential Revision: D81185673

facebook-github-bot · 2025-08-28T16:25:05Z

This pull request was exported from Phabricator. Differential Revision: D81185673

fegin

@Shagun-G Can you provide more context to justify the PR? I currently cannot see n strong benefit to further create 3 sub functions. This makes the train logic less straitghforward.

Shagun-G · 2025-08-28T17:54:21Z

@fegin Thank you for taking a look at the PR. The goal of this PR is to divide the main steps of training within the train_step function into further components. The goal is avoid code duplication while further developing over the training by allow the reuse of key components such as gradient computation and optimizer step while working to further or working of these components by just changing their corresponding functions instead of the entire train step. Examples include:

To perform computations using gradient over old parameters before taking the optimizer step, one can duplicate train_step but still use the gradient_computation and run_optimizer_step function and not duplicate these key components.
If one wants to expand the gradient computation logic, one can change only change the gradient computation function without interfering with the rest of the train cycle.
The should_continue_training function allows the addition to additional termination criterion based on different budgets like number of tokens without having to duplicate the train() function.

I am open to suggestions and making changes but the main goal was to make it easy to develop over torchtitan in terms of training algorithms by inheriting code instead of duplicating it

tianyu-l

Thank you for the PR and sharing interesting research work!

I agree that breaking trainer into smaller pieces can enable more flexible experiments. However, I think there is a tradeoff between flexibility and readability / succinctness. In torchtitan we do bias toward simplicity, especially when such changes would benefit only a group of people.

I do think for the frontier work you mentioned, it might be worth to have a separate train script, especially as your #1652 has landed. Note that even in torchtitan we host multiple train.py https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/train.py

torchtitan is evolving, and we might find such changes to be necessary in the future, but right now we won't accept this PR.

Shagun-G · 2025-08-29T14:55:05Z

@tianyu-l Thank you for the comments. I agree on the tradeoff and will be happy to revisit this in the future.

Shagun-G requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 28, 2025 16:24

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 28, 2025

facebook-github-bot added the fb-exported label Aug 28, 2025

fegin requested changes Aug 28, 2025

View reviewed changes

tianyu-l requested changes Aug 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

code refactor : making key steps modular train_step() #1650

code refactor : making key steps modular train_step() #1650

Uh oh!

Shagun-G commented Aug 28, 2025

Uh oh!

facebook-github-bot commented Aug 28, 2025

Uh oh!

fegin left a comment

Uh oh!

Shagun-G commented Aug 28, 2025

Uh oh!

tianyu-l left a comment •

edited

Loading

Uh oh!

Shagun-G commented Aug 29, 2025

Uh oh!

Uh oh!

code refactor : making key steps modular train_step() #1650

Are you sure you want to change the base?

code refactor : making key steps modular train_step() #1650

Uh oh!

Conversation

Shagun-G commented Aug 28, 2025

Uh oh!

facebook-github-bot commented Aug 28, 2025

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Shagun-G commented Aug 28, 2025

Uh oh!

tianyu-l left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shagun-G commented Aug 29, 2025

Uh oh!

Uh oh!

tianyu-l left a comment •

edited

Loading