docs: add model instability debugging page by gabrielfruet · Pull Request #813 · lightly-ai/lightly-train

gabrielfruet · 2026-06-25T16:45:25Z

What has changed and why?

Adds a new documentation page for Model Instability Debugging (docs/source/debugging/model_instability.md), linked from the main docs nav and from the gradient_norm entry on the Train Settings page.

This is a follow-up to #811 and is intended to be the home for instability-debugging tooling as new methods land. For now it documents gradient norm logging:

What training instability looks like (loss spikes, NaN/inf collapse, plateaus).
How to view gradient_norm (console grad_norm, TensorBoard, MLflow, W&B).
How to interpret the trend (stable / exploding / vanishing) rather than absolute thresholds.
Common next actions (LR, precision, resume_interrupted, normalization).

The page is structured so future debugging tools can be appended under the same section.

How has it been tested?

mdformat --check on all three changed files → exit 0.
sphinx-build -b html --fail-on-warning --keep-going source build/local → exit 0, no warnings.
Built with PYTHONPATH=../src ../.venv/bin/sphinx-build (worktree venv editable-source gotcha).
Verified both cross-references (../settings/train_settings.md, ../faq.md) resolve.

Did you update CHANGELOG.md?

Yes
Not needed (documentation-only change)

Did you update the documentation?

Yes
Not needed (internal change without effects for user)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 09843e56bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-25T16:47:24Z

+  - Resume from the last good checkpoint using
+    [`resume_interrupted`](../settings/train_settings.md) after changing settings.


Don't resume interrupted runs after changing settings

In the mitigation flow where the user lowers LR or changes precision, this tells them to use resume_interrupted after changing settings. The train settings page documents resume_interrupted as crash recovery only and says training arguments must not be modified; the training loop restores the optimizer and scheduler state from out/checkpoints/last.ckpt, so changed hyperparameters like LR/schedules may be overwritten or mixed with the old run state. Users following this advice may not actually test the intended fix and can keep resuming the unstable run; this should point to the “load checkpoint for a new run” flow instead.

Useful? React with 👍 / 👎.

gabrielfruet · 2026-06-26T00:58:32Z

/review

gabrielfruet · 2026-06-26T00:59:32Z

/review

Corrected the description of the gradient norm calculation method in the documentation.

mrpositron

LGTM!

…-docs

chatgpt-codex-connector Bot reviewed Jun 25, 2026

View reviewed changes

gabrielfruet force-pushed the gabriel-trn-2254-model-instability-debugging-docs branch from 09843e5 to 1dfbf74 Compare June 25, 2026 17:39

gabrielfruet force-pushed the gabriel-trn-2254-gradient-norm-logging branch from feccfc0 to 230c7b2 Compare June 25, 2026 18:01

gabrielfruet force-pushed the gabriel-trn-2254-model-instability-debugging-docs branch from 273b7dc to 2ba11f8 Compare June 25, 2026 19:07

Base automatically changed from gabriel-trn-2254-gradient-norm-logging to main June 26, 2026 13:20

gabrielfruet added 2 commits June 29, 2026 08:36

docs: add model instability debugging page

40fb7b3

formatting and fixes

8a42873

gabrielfruet force-pushed the gabriel-trn-2254-model-instability-debugging-docs branch from 2ba11f8 to 8a42873 Compare June 29, 2026 11:44

mrpositron reviewed Jun 29, 2026

View reviewed changes

Comment thread docs/source/debugging/model_instability.md

gabrielfruet and others added 2 commits June 29, 2026 12:08

Fix gradient norm calculation description

59740ec

Corrected the description of the gradient norm calculation method in the documentation.

format

3593d75

mrpositron approved these changes Jun 29, 2026

View reviewed changes

Merge branch 'main' into gabriel-trn-2254-model-instability-debugging…

4ccc91f

…-docs

gabrielfruet enabled auto-merge (squash) June 29, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: add model instability debugging page#813

docs: add model instability debugging page#813
gabrielfruet wants to merge 5 commits into
mainfrom
gabriel-trn-2254-model-instability-debugging-docs

gabrielfruet commented Jun 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Uh oh!

gabrielfruet commented Jun 26, 2026

Uh oh!

gabrielfruet commented Jun 26, 2026

Uh oh!

Uh oh!

mrpositron left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- Resume from the last good checkpoint using
		[`resume_interrupted`](../settings/train_settings.md) after changing settings.

Uh oh!

Conversation

gabrielfruet commented Jun 25, 2026

What has changed and why?

How has it been tested?

Did you update CHANGELOG.md?

Did you update the documentation?

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gabrielfruet commented Jun 26, 2026

Uh oh!

gabrielfruet commented Jun 26, 2026

Uh oh!

Uh oh!

mrpositron left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants