Skip to content

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Sep 9, 2025

Addresses sillsdev/serval#708 by:

  • Implementing functionality of auto_grad_acc=True from silnlp
  • And only doing ClearML progress reporting web API calls at the 'percent' frequency.

Using fewer gradient accumulations steps when possible causes significant speed-ups but only occasionally seems possible given memory constraints.

Only updating the progress each (max_steps/100) step causes substantial speed-ups: 1.75hrs versus 3.25hrs. (Doing no progress updating is 1.5hrs). This seemed to be a happy medium - if you think a different compromise is ideal, let me know.

Overall, in a training run where auto_grad_acc doesn't do anything other than arrive at the default values:

Type Train time in hours
Machine.py (Main) 3.25
Machine.py (This PR) 1.75
SILNLP 1.25

Numbers are all approximate. They vary by about 5 minutes across multiple runs. Test job was training on a complete NT.

(As you can see, SILNLP is still slightly faster but this may be related to the number of examples because of differences in key terms processing. See sillsdev/serval#751).


This change is Reviewable

@Enkidu93 Enkidu93 requested a review from ddaspit September 9, 2025 21:53
@Enkidu93
Copy link
Collaborator Author

(I'm still working on trying to update transformers to be consistent with silnlp + potential speed-ups).

@codecov-commenter
Copy link

codecov-commenter commented Sep 10, 2025

Codecov Report

❌ Patch coverage is 72.22222% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.96%. Comparing base (0e707ce) to head (06845e5).

Files with missing lines Patch % Lines
...tion/huggingface/hugging_face_nmt_model_trainer.py 72.22% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #227      +/-   ##
==========================================
- Coverage   91.11%   90.96%   -0.16%     
==========================================
  Files         334      334              
  Lines       21742    21431     -311     
==========================================
- Hits        19810    19494     -316     
- Misses       1932     1937       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Enkidu93
Copy link
Collaborator Author

I've updated transformers, so this is ready for review. No appreciable difference with the update. I also updated the settings.yaml in machine.py with parameters that I updated in Serval including: 1) those relevant to the auto_grad_acc-equivalent functionality and 2) setting tf32 to true (which likely won't make a big difference, but even just for consistency with silnlp, I thought it was appropriate to update it).

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to use the recent update that allows you to disable attention in HuggingFaceNmtModelFactory. This should allow SDPA to work, which will improve performance. All that should be necessary is to pass output_attentions=False to the constructor for HuggingFaceNmtEngine.

Reviewable status: 0 of 5 files reviewed, all discussions resolved

@Enkidu93
Copy link
Collaborator Author

It would be good to use the recent update that allows you to disable attention in HuggingFaceNmtModelFactory. This should allow SDPA to work, which will improve performance. All that should be necessary is to pass output_attentions=False to the constructor for HuggingFaceNmtEngine.

Reviewable status: 0 of 5 files reviewed, all discussions resolved

Are you referring to the update Peter made? I.e, I should rebase and re-run and check the train time? I thought that that was only affecting inferencing? Does SDPA also need to be enabled for training separately?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does only affect inferencing, but it would still speed up the overall build.

@ddaspit reviewed 2 of 4 files at r1, 1 of 2 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)


pyproject.toml line 32 at r4 (raw file):

extraPaths = ["tests"]
reportMissingModuleSource = false
reportMissingImports = false

It is usually better to disable this on a specific line.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 270 at r4 (raw file):

            # as the first generated token. We ask the user to explicitly provide this as --forced_bos_token argument.
            forced_bos_token_id = tokenizer.convert_tokens_to_ids(self._tgt_lang)
            # model.config.forced_bos_token_id = forced_bos_token_id

Why was this commented out?

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense. Done.

Reviewable status: 4 of 6 files reviewed, 2 unresolved discussions (waiting on @ddaspit)


pyproject.toml line 32 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It is usually better to disable this on a specific line.

Sorry - this was a temporary change I meant to revert when I was trying to get the debugger (which mysteriously died) to work, I think.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 270 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Why was this commented out?

There was a warning saying not to set it in the config, only in generation_config. But I think that was with different library versions as I was trying to update dependencies. Undone.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 6 files reviewed, 2 unresolved discussions (waiting on @ddaspit)


pyproject.toml line 32 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Sorry - this was a temporary change I meant to revert when I was trying to get the debugger (which mysteriously died) to work, I think.

Never mind - it was for the macOS CI build 💡. This is what happens when you're working on too many things at once haha. I went ahead and added it in-line.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@ddaspit reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@Enkidu93 Enkidu93 merged commit 3a79c67 into main Sep 15, 2025
14 checks passed
@Enkidu93 Enkidu93 deleted the faster_training branch September 15, 2025 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants