Skip to content

feat: add NaN capture and replay for fine-tuning#826

Merged
gabrielfruet merged 4 commits into
mainfrom
gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch
Jul 3, 2026
Merged

feat: add NaN capture and replay for fine-tuning#826
gabrielfruet merged 4 commits into
mainfrom
gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch

Conversation

@gabrielfruet

Copy link
Copy Markdown
Contributor

What has changed and why?

  • New NaNCapture debug tool for fine-tuning (train_* task APIs): when a
    NaN/Inf is detected in parameter gradients (scanned after gradient
    accumulation, before clip_gradients/optimizer.step), it captures
    reproducible state and halts training.
  • Capture is a single self-contained file
    out_dir/debug/nan_capture/rank{R}/nan_capture.pt holding the model state
    dict, the TrainModel class path + init kwargs (for reconstruction), the
    step's microbatches, and torch/CUDA RNG state. The standard
    checkpoints/last.ckpt is not touched, so resume_interrupted is
    unaffected.
  • Adds load_nan_capture(dir) + NaNCaptureState.replay() for zero-setup
    replay: reconstructs the model, restores RNG, and re-runs the triggering
    forward+backward (mirrors the training loop; stops before the optimizer step)
    to reproduce the NaN deterministically in a notebook/REPL.
  • Wired into _commands/train_task.py alongside the existing underflow/overflow
    monitor from feat: integrate HF DebugUnderflowOverflow into fine-tuning #814. Enable with debug_args={"nancapture": {"enabled": True}}.

Replay is debug-only — the training loop carries no replay flag.

Closes TRN-2256.

How has it been tested?

  • pytest tests/_debug/test_nan_capture.py → 11/11 pass (config, monitor grad
    scan + capture payload, buffer clone/detach/reset, replay roundtrip +
    reproduction).
  • make format → clean (no unintended changes).
  • mypy on the 4 changed files → clean.

Did you update CHANGELOG.md?

  • Yes

Did you update the documentation?

@gabrielfruet gabrielfruet marked this pull request as draft June 29, 2026 16:20

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d330952ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/lightly_train/_commands/train_task.py Outdated
Comment thread src/lightly_train/_commands/train_task.py Outdated
@gabrielfruet

Copy link
Copy Markdown
Contributor Author

/review

@gabrielfruet gabrielfruet marked this pull request as ready for review June 29, 2026 20:39

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 71b3fb4715

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/lightly_train/_debug/nan_capture.py Outdated
Comment thread src/lightly_train/_commands/train_task.py Outdated
@gabrielfruet gabrielfruet force-pushed the gabriel-trn-2255-implement-debugunderflowoverflow-from-huggingface branch from e20e8aa to e18239b Compare June 30, 2026 17:31
Base automatically changed from gabriel-trn-2255-implement-debugunderflowoverflow-from-huggingface to main July 1, 2026 12:39
@gabrielfruet gabrielfruet force-pushed the gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch branch 2 times, most recently from 532d2ba to 50b9143 Compare July 1, 2026 13:54
@gabrielfruet gabrielfruet force-pushed the gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch branch from 50b9143 to 220a02a Compare July 1, 2026 14:59

@liopeer liopeer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Adressing the comments is optional.

Comment thread src/lightly_train/_commands/train_task.py Outdated
Comment thread src/lightly_train/_debug/nan_capture.py
@gabrielfruet gabrielfruet force-pushed the gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch branch from 40f6516 to 90ca5d7 Compare July 2, 2026 19:33
@gabrielfruet gabrielfruet enabled auto-merge (squash) July 3, 2026 13:16
@gabrielfruet gabrielfruet merged commit 09cdf4c into main Jul 3, 2026
13 checks passed
@gabrielfruet gabrielfruet deleted the gabriel-trn-2256-implement-nancapture-for-reproducing-bad-batch branch July 3, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants