Skip to content

Conversation

Shagun-G
Copy link
Contributor

Summary:
The refactor creates sperate functions for key steps within the train_step function to allow easy changes for future work. It creates 3 new functions:

  1. gradient_computation : computes gradients via accumulation
  2. run_optimizer_Step : runs all optimizer steps
  3. compute_global_loss : calculates the global loss when logging

The diff also creates a seperate function should_continue_training() to expand possible criterions for termination in the future without changing the train function.

Rollback Plan:

Differential Revision: D81185673

Summary:
The refactor creates sperate functions for key steps within the train_step function to allow easy changes for future work. It creates 3 new functions:

1. gradient_computation : computes gradients via accumulation
2. run_optimizer_Step : runs all optimizer steps
3. compute_global_loss : calculates the global loss when logging

The diff also creates a seperate function should_continue_training() to expand possible criterions for termination in the future without changing the train function.

Rollback Plan:

Differential Revision: D81185673
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 28, 2025
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81185673

Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shagun-G Can you provide more context to justify the PR? I currently cannot see n strong benefit to further create 3 sub functions. This makes the train logic less straitghforward.

@Shagun-G
Copy link
Contributor Author

@fegin Thank you for taking a look at the PR. The goal of this PR is to divide the main steps of training within the train_step function into further components. The goal is avoid code duplication while further developing over the training by allow the reuse of key components such as gradient computation and optimizer step while working to further or working of these components by just changing their corresponding functions instead of the entire train step. Examples include:

  1. To perform computations using gradient over old parameters before taking the optimizer step, one can duplicate train_step but still use the gradient_computation and run_optimizer_step function and not duplicate these key components.

  2. If one wants to expand the gradient computation logic, one can change only change the gradient computation function without interfering with the rest of the train cycle.

  3. The should_continue_training function allows the addition to additional termination criterion based on different budgets like number of tokens without having to duplicate the train() function.

I am open to suggestions and making changes but the main goal was to make it easy to develop over torchtitan in terms of training algorithms by inheriting code instead of duplicating it

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR and sharing interesting research work!

I agree that breaking trainer into smaller pieces can enable more flexible experiments. However, I think there is a tradeoff between flexibility and readability / succinctness. In torchtitan we do bias toward simplicity, especially when such changes would benefit only a group of people.

I do think for the frontier work you mentioned, it might be worth to have a separate train script, especially as your #1652 has landed. Note that even in torchtitan we host multiple train.py https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/train.py

torchtitan is evolving, and we might find such changes to be necessary in the future, but right now we won't accept this PR.

@Shagun-G
Copy link
Contributor Author

@tianyu-l Thank you for the comments. I agree on the tradeoff and will be happy to revisit this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants