Skip to content

docs: add model instability debugging page#813

Open
gabrielfruet wants to merge 5 commits into
mainfrom
gabriel-trn-2254-model-instability-debugging-docs
Open

docs: add model instability debugging page#813
gabrielfruet wants to merge 5 commits into
mainfrom
gabriel-trn-2254-model-instability-debugging-docs

Conversation

@gabrielfruet

Copy link
Copy Markdown
Contributor

What has changed and why?

Adds a new documentation page for Model Instability Debugging (docs/source/debugging/model_instability.md), linked from the main docs nav and from the gradient_norm entry on the Train Settings page.

This is a follow-up to #811 and is intended to be the home for instability-debugging tooling as new methods land. For now it documents gradient norm logging:

  • What training instability looks like (loss spikes, NaN/inf collapse, plateaus).
  • How to view gradient_norm (console grad_norm, TensorBoard, MLflow, W&B).
  • How to interpret the trend (stable / exploding / vanishing) rather than absolute thresholds.
  • Common next actions (LR, precision, resume_interrupted, normalization).

The page is structured so future debugging tools can be appended under the same section.

How has it been tested?

  • mdformat --check on all three changed files → exit 0.
  • sphinx-build -b html --fail-on-warning --keep-going source build/local → exit 0, no warnings.
    Built with PYTHONPATH=../src ../.venv/bin/sphinx-build (worktree venv editable-source gotcha).
  • Verified both cross-references (../settings/train_settings.md, ../faq.md) resolve.

Did you update CHANGELOG.md?

  • Yes
  • Not needed (documentation-only change)

Did you update the documentation?

  • Yes
  • Not needed (internal change without effects for user)

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 09843e56bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +73 to +74
- Resume from the last good checkpoint using
[`resume_interrupted`](../settings/train_settings.md) after changing settings.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't resume interrupted runs after changing settings

In the mitigation flow where the user lowers LR or changes precision, this tells them to use resume_interrupted after changing settings. The train settings page documents resume_interrupted as crash recovery only and says training arguments must not be modified; the training loop restores the optimizer and scheduler state from out/checkpoints/last.ckpt, so changed hyperparameters like LR/schedules may be overwritten or mixed with the old run state. Users following this advice may not actually test the intended fix and can keep resuming the unstable run; this should point to the “load checkpoint for a new run” flow instead.

Useful? React with 👍 / 👎.

@gabrielfruet gabrielfruet force-pushed the gabriel-trn-2254-model-instability-debugging-docs branch from 09843e5 to 1dfbf74 Compare June 25, 2026 17:39
@gabrielfruet gabrielfruet force-pushed the gabriel-trn-2254-gradient-norm-logging branch from feccfc0 to 230c7b2 Compare June 25, 2026 18:01
@gabrielfruet gabrielfruet force-pushed the gabriel-trn-2254-model-instability-debugging-docs branch from 273b7dc to 2ba11f8 Compare June 25, 2026 19:07
@gabrielfruet

Copy link
Copy Markdown
Contributor Author

/review

1 similar comment
@gabrielfruet

Copy link
Copy Markdown
Contributor Author

/review

Base automatically changed from gabriel-trn-2254-gradient-norm-logging to main June 26, 2026 13:20
@gabrielfruet gabrielfruet force-pushed the gabriel-trn-2254-model-instability-debugging-docs branch from 2ba11f8 to 8a42873 Compare June 29, 2026 11:44
Comment thread docs/source/debugging/model_instability.md
gabrielfruet and others added 2 commits June 29, 2026 12:08
Corrected the description of the gradient norm calculation method in the documentation.

@mrpositron mrpositron left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gabrielfruet gabrielfruet enabled auto-merge (squash) June 29, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants