docs: add model instability debugging page#813
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 09843e56bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - Resume from the last good checkpoint using | ||
| [`resume_interrupted`](../settings/train_settings.md) after changing settings. |
There was a problem hiding this comment.
Don't resume interrupted runs after changing settings
In the mitigation flow where the user lowers LR or changes precision, this tells them to use resume_interrupted after changing settings. The train settings page documents resume_interrupted as crash recovery only and says training arguments must not be modified; the training loop restores the optimizer and scheduler state from out/checkpoints/last.ckpt, so changed hyperparameters like LR/schedules may be overwritten or mixed with the old run state. Users following this advice may not actually test the intended fix and can keep resuming the unstable run; this should point to the “load checkpoint for a new run” flow instead.
Useful? React with 👍 / 👎.
09843e5 to
1dfbf74
Compare
feccfc0 to
230c7b2
Compare
273b7dc to
2ba11f8
Compare
|
/review |
1 similar comment
|
/review |
2ba11f8 to
8a42873
Compare
Corrected the description of the gradient norm calculation method in the documentation.
What has changed and why?
Adds a new documentation page for Model Instability Debugging (
docs/source/debugging/model_instability.md), linked from the main docs nav and from thegradient_normentry on the Train Settings page.This is a follow-up to #811 and is intended to be the home for instability-debugging tooling as new methods land. For now it documents gradient norm logging:
gradient_norm(consolegrad_norm, TensorBoard, MLflow, W&B).resume_interrupted, normalization).The page is structured so future debugging tools can be appended under the same section.
How has it been tested?
mdformat --checkon all three changed files → exit 0.sphinx-build -b html --fail-on-warning --keep-going source build/local→ exit 0, no warnings.Built with
PYTHONPATH=../src ../.venv/bin/sphinx-build(worktree venv editable-source gotcha).../settings/train_settings.md,../faq.md) resolve.Did you update CHANGELOG.md?
Did you update the documentation?