-
Notifications
You must be signed in to change notification settings - Fork 88
docs: add model instability debugging page #813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
gabrielfruet
wants to merge
5
commits into
main
Choose a base branch
from
gabriel-trn-2254-model-instability-debugging-docs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
40fb7b3
docs: add model instability debugging page
gabrielfruet 8a42873
formatting and fixes
gabrielfruet 59740ec
Fix gradient norm calculation description
gabrielfruet 3593d75
format
gabrielfruet 4ccc91f
Merge branch 'main' into gabriel-trn-2254-model-instability-debugging…
gabrielfruet File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| (model-instability-debugging)= | ||
|
|
||
| # Model Instability Debugging | ||
|
|
||
| Training instabilities — such as exploding or vanishing gradients, sudden loss spikes, | ||
| or numerical collapse to `NaN`/`inf` — can derail a run silently or abruptly. This page | ||
| collects the tools LightlyTrain provides to detect and diagnose these issues. | ||
|
|
||
| :::\{note} This section is growing. More debugging tools will be documented here as they | ||
| are added. The first available tool is gradient norm logging. ::: | ||
|
|
||
| ## What Instability Looks Like | ||
|
|
||
| Common symptoms of an unstable run: | ||
|
|
||
| - The training loss spikes sharply or collapses to `NaN`/`inf`. | ||
| - The loss plateaus at a high value and never improves. | ||
| - The model stops learning partway through training (validation metrics flatten or | ||
| regress). | ||
| - Training crashes with a numerical error during the forward or backward pass. | ||
|
|
||
| Not all of these mean instability — a high plateau can also be caused by a too low | ||
| learning rate or a data issue. Use the tools below to distinguish between them. | ||
|
|
||
| ## Gradient Norm Logging | ||
|
|
||
| The total gradient norm is the single most useful signal for spotting exploding and | ||
| vanishing gradients. LightlyTrain logs it for every training step: | ||
|
|
||
| - `gradient_norm`: Total gradient norm computed after backpropagation, before the | ||
| optimizer step. If gradient clipping is enabled (`gradient_clip_val > 0`) this is the | ||
| pre-clipping norm; otherwise it is computed via an L2 norm. It is also shown in the | ||
| console progress line as `grad_norm`. | ||
|
|
||
| It is written to all configured loggers (`metrics.jsonl`, TensorBoard, MLflow, Weights & | ||
| Biases) at the cadence set by | ||
| [`log_every_num_steps`](../settings/train_settings.md#log_every_num_steps). | ||
|
|
||
| ### How to View the Gradient Norm | ||
|
|
||
| - **Console:** The progress line shows `grad_norm` for each logged training step. | ||
|
|
||
| - **TensorBoard:** Plot `gradient_norm` over training steps: | ||
|
|
||
| ```bash | ||
| tensorboard --logdir out/my_experiment | ||
| ``` | ||
|
|
||
| - **MLflow / Weights & Biases:** The `gradient_norm` metric is available under the same | ||
| key. See [](../settings/train_settings.md) for how to enable these loggers. | ||
|
|
||
| ### How to Interpret the Trend | ||
|
|
||
| Interpret the gradient norm as a trend over steps, not as an isolated value. Its | ||
| absolute scale depends on the model, dataset, and batch size, so there is no universal | ||
| "good" value. What matters is the shape: | ||
|
|
||
| - **Stable:** The norm fluctuates within a steady band across training. | ||
| - **Exploding gradients:** The norm grows rapidly, often by several orders of magnitude, | ||
| and may precede a loss spike or a `NaN` collapse. | ||
| - **Vanishing gradients:** The norm shrinks toward zero and stays there, often | ||
| accompanying a loss that no longer decreases. | ||
|
|
||
| A short-lived spike during warmup or learning-rate scheduling is usually normal. A | ||
| persistent upward or downward drift is the signal to act on. | ||
|
|
||
| ### Common Next Actions | ||
|
|
||
| - **Exploding gradients:** | ||
| - Lower the learning rate with [`model_args.lr`](../settings/train_settings.md). | ||
| - Switch to a more stable precision, e.g. `precision="bf16-mixed"` or | ||
| `precision="32-true"` (see [](../settings/train_settings.md)). | ||
| - **Vanishing gradients:** | ||
| - Increase the learning rate, especially for small models (~10M parameters or fewer). | ||
| - Check that the input normalization in `transform_args` matches your data | ||
| distribution. | ||
| - **NaN/inf collapse:** Re-run from the latest checkpoint. If it reproduces, switch to | ||
| `precision="32-true"` to isolate whether the instability is caused by | ||
| reduced-precision arithmetic. | ||
|
|
||
| See the FAQ entry on [improving model performance](../faq.md) for broader guidance on | ||
| stable training. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.