feat: add NaN capture and replay for fine-tuning by gabrielfruet · Pull Request #826 · lightly-ai/lightly-train

gabrielfruet · 2026-06-29T16:18:16Z

What has changed and why?

New NaNCapture debug tool for fine-tuning (train_* task APIs): when a
NaN/Inf is detected in parameter gradients (scanned after gradient
accumulation, before clip_gradients/optimizer.step), it captures
reproducible state and halts training.
Capture is a single self-contained file
out_dir/debug/nan_capture/rank{R}/nan_capture.pt holding the model state
dict, the TrainModel class path + init kwargs (for reconstruction), the
step's microbatches, and torch/CUDA RNG state. The standard
checkpoints/last.ckpt is not touched, so resume_interrupted is
unaffected.
Adds load_nan_capture(dir) + NaNCaptureState.replay() for zero-setup
replay: reconstructs the model, restores RNG, and re-runs the triggering
forward+backward (mirrors the training loop; stops before the optimizer step)
to reproduce the NaN deterministically in a notebook/REPL.
Wired into _commands/train_task.py alongside the existing underflow/overflow
monitor from feat: integrate HF DebugUnderflowOverflow into fine-tuning #814. Enable with debug_args={"nancapture": {"enabled": True}}.

Replay is debug-only — the training loop carries no replay flag.

Closes TRN-2256.

How has it been tested?

pytest tests/_debug/test_nan_capture.py → 11/11 pass (config, monitor grad
scan + capture payload, buffer clone/detach/reset, replay roundtrip +
reproduction).
make format → clean (no unintended changes).
mypy on the 4 changed files → clean.

Did you update CHANGELOG.md?

Yes

Did you update the documentation?

Not needed (debugging API; docs deferred, matching feat: integrate HF DebugUnderflowOverflow into fine-tuning #814)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d330952ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gabrielfruet · 2026-06-29T20:39:50Z

/review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 71b3fb4715

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

liopeer

LGTM! Adressing the comments is optional.

…producing-bad-batch

gabrielfruet marked this pull request as draft June 29, 2026 16:20

chatgpt-codex-connector Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread src/lightly_train/_commands/train_task.py Outdated

Comment thread src/lightly_train/_commands/train_task.py Outdated

gabrielfruet marked this pull request as ready for review June 29, 2026 20:39

chatgpt-codex-connector Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread src/lightly_train/_debug/nan_capture.py Outdated

Comment thread src/lightly_train/_commands/train_task.py Outdated

gabrielfruet force-pushed the gabriel-trn-2255-implement-debugunderflowoverflow-from-huggingface branch from e20e8aa to e18239b Compare June 30, 2026 17:31

Base automatically changed from gabriel-trn-2255-implement-debugunderflowoverflow-from-huggingface to main July 1, 2026 12:39

gabrielfruet force-pushed the gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch branch 2 times, most recently from 532d2ba to 50b9143 Compare July 1, 2026 13:54

feat: add NaN capture and replay for fine-tuning

220a02a

gabrielfruet force-pushed the gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch branch from 50b9143 to 220a02a Compare July 1, 2026 14:59

liopeer approved these changes Jul 2, 2026

View reviewed changes

Comment thread src/lightly_train/_commands/train_task.py Outdated

Comment thread src/lightly_train/_debug/nan_capture.py

gabrielfruet added 2 commits July 2, 2026 16:33

style: use 3.8-compatible comma form for debug monitors

7e5319e

docs: cite Chaim Rand NaN capture-and-replay reference

90ca5d7

gabrielfruet force-pushed the gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch branch from 40f6516 to 90ca5d7 Compare July 2, 2026 19:33

Merge branch 'main' into gabriel-trn-2256-implement-nancapture-for-re…

4c75391

…producing-bad-batch

gabrielfruet enabled auto-merge (squash) July 3, 2026 13:16

gabrielfruet merged commit 09cdf4c into main Jul 3, 2026
13 checks passed

gabrielfruet deleted the gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch branch July 3, 2026 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add NaN capture and replay for fine-tuning#826

feat: add NaN capture and replay for fine-tuning#826
gabrielfruet merged 4 commits into
mainfrom
gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch

gabrielfruet commented Jun 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gabrielfruet commented Jun 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

liopeer left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gabrielfruet commented Jun 29, 2026

What has changed and why?

How has it been tested?

Did you update CHANGELOG.md?

Did you update the documentation?

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

gabrielfruet commented Jun 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

liopeer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants