Faster NMT training for machine.py #227

Enkidu93 · 2025-09-09T21:53:34Z

Implementing functionality of auto_grad_acc=True from silnlp
And only doing ClearML progress reporting web API calls at the 'percent' frequency.

Using fewer gradient accumulations steps when possible causes significant speed-ups but only occasionally seems possible given memory constraints.

Only updating the progress each (max_steps/100) step causes substantial speed-ups: 1.75hrs versus 3.25hrs. (Doing no progress updating is 1.5hrs). This seemed to be a happy medium - if you think a different compromise is ideal, let me know.

Overall, in a training run where auto_grad_acc doesn't do anything other than arrive at the default values:

Type	Train time in hours
Machine.py (Main)	3.25
Machine.py (This PR)	1.75
SILNLP	1.25

^{Numbers are all approximate. They vary by about 5 minutes across multiple runs. Test job was training on a complete NT.}

(As you can see, SILNLP is still slightly faster but this may be related to the number of examples because of differences in key terms processing. See sillsdev/serval#751).

This change is

Enkidu93 · 2025-09-10T13:57:08Z

(I'm still working on trying to update transformers to be consistent with silnlp + potential speed-ups).

codecov-commenter · 2025-09-10T14:19:49Z

Codecov Report

❌ Patch coverage is 72.22222% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.96%. Comparing base (0e707ce) to head (06845e5).

Files with missing lines	Patch %	Lines
...tion/huggingface/hugging_face_nmt_model_trainer.py	72.22%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #227      +/-   ##
==========================================
- Coverage   91.11%   90.96%   -0.16%     
==========================================
  Files         334      334              
  Lines       21742    21431     -311     
==========================================
- Hits        19810    19494     -316     
- Misses       1932     1937       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Enkidu93 · 2025-09-10T15:44:57Z

I've updated transformers, so this is ready for review. No appreciable difference with the update. I also updated the settings.yaml in machine.py with parameters that I updated in Serval including: 1) those relevant to the auto_grad_acc-equivalent functionality and 2) setting tf32 to true (which likely won't make a big difference, but even just for consistency with silnlp, I thought it was appropriate to update it).

ddaspit

It would be good to use the recent update that allows you to disable attention in HuggingFaceNmtModelFactory. This should allow SDPA to work, which will improve performance. All that should be necessary is to pass output_attentions=False to the constructor for HuggingFaceNmtEngine.

Reviewable status: 0 of 5 files reviewed, all discussions resolved

Enkidu93 · 2025-09-10T20:55:27Z

It would be good to use the recent update that allows you to disable attention in HuggingFaceNmtModelFactory. This should allow SDPA to work, which will improve performance. All that should be necessary is to pass output_attentions=False to the constructor for HuggingFaceNmtEngine.

Reviewable status: 0 of 5 files reviewed, all discussions resolved

Are you referring to the update Peter made? I.e, I should rebase and re-run and check the train time? I thought that that was only affecting inferencing? Does SDPA also need to be enabled for training separately?

ddaspit

It does only affect inferencing, but it would still speed up the overall build.

@ddaspit reviewed 2 of 4 files at r1, 1 of 2 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)

pyproject.toml line 32 at r4 (raw file):

extraPaths = ["tests"]
reportMissingModuleSource = false
reportMissingImports = false

It is usually better to disable this on a specific line.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 270 at r4 (raw file):

            # as the first generated token. We ask the user to explicitly provide this as --forced_bos_token argument.
            forced_bos_token_id = tokenizer.convert_tokens_to_ids(self._tgt_lang)
            # model.config.forced_bos_token_id = forced_bos_token_id

Why was this commented out?

Enkidu93

OK, makes sense. Done.

Reviewable status: 4 of 6 files reviewed, 2 unresolved discussions (waiting on @ddaspit)

pyproject.toml line 32 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It is usually better to disable this on a specific line.

Sorry - this was a temporary change I meant to revert when I was trying to get the debugger (which mysteriously died) to work, I think.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 270 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Why was this commented out?

There was a warning saying not to set it in the config, only in generation_config. But I think that was with different library versions as I was trying to update dependencies. Undone.

Enkidu93

Reviewable status: 4 of 6 files reviewed, 2 unresolved discussions (waiting on @ddaspit)

pyproject.toml line 32 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Sorry - this was a temporary change I meant to revert when I was trying to get the debugger (which mysteriously died) to work, I think.

Never mind - it was for the macOS CI build 💡. This is what happens when you're working on too many things at once haha. I went ahead and added it in-line.

ddaspit

@ddaspit reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

Enkidu93 requested a review from ddaspit September 9, 2025 21:53

ddaspit reviewed Sep 10, 2025

View reviewed changes

Enkidu93 added 19 commits September 11, 2025 09:48

Update transformers; use auto gradient accumulation steps via accelerate

f210dd4

Update lock file

8ddfbce

Fix torch versioning

194c8ed

Try different torch set up

2500e61

Another dependency change

4c98131

Another one

9eb68f9

Another one

88a5079

Try another one

d8099db

And another one

3473943

Fix poetry versioning issue

4c5aaa2

Another one

3dd0c31

Another attempt

edf3ca3

Another attempt

fcdc6b3

Ignore missing imports to avoid issue with macos build

9e94f18

No callbacks

371459d

Update only at the 'percent' step

c7136a3

Avoid division by zero error

cc05311

Update transformers version

a11cb73

Update settings yaml to match what passed from Serval during tests

30cdddc

Enkidu93 force-pushed the faster_training branch from d3e7fd8 to 30cdddc Compare September 11, 2025 13:48

Set output_attentions to False in nmt_model_factory to enable SDPA

440456b

ddaspit requested changes Sep 15, 2025

View reviewed changes

Undo temporary changes

dd4d63c

Enkidu93 commented Sep 15, 2025

View reviewed changes

Add report missing imports to allow CI build to run on macOS

06845e5

Enkidu93 commented Sep 15, 2025

View reviewed changes

ddaspit approved these changes Sep 15, 2025

View reviewed changes

Enkidu93 merged commit 3a79c67 into main Sep 15, 2025
14 checks passed

Enkidu93 deleted the faster_training branch September 15, 2025 18:33

Enkidu93 mentioned this pull request Sep 15, 2025

Training times are slower in Serval than in SILNLP sillsdev/serval#708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Faster NMT training for machine.py #227

Faster NMT training for machine.py #227

Uh oh!

Enkidu93 commented Sep 9, 2025 •

edited

Loading

Uh oh!

Enkidu93 commented Sep 10, 2025

Uh oh!

codecov-commenter commented Sep 10, 2025 •

edited

Loading

Uh oh!

Enkidu93 commented Sep 10, 2025

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 commented Sep 10, 2025

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Faster NMT training for machine.py #227

Faster NMT training for machine.py #227

Uh oh!

Conversation

Enkidu93 commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Enkidu93 commented Sep 10, 2025

Uh oh!

codecov-commenter commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Enkidu93 commented Sep 10, 2025

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 commented Sep 10, 2025

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Enkidu93 commented Sep 9, 2025 •

edited

Loading

codecov-commenter commented Sep 10, 2025 •

edited

Loading