TrainCheck: Training with Confidence

TrainCheck is a lightweight tool for proactively catching silent errors in deep learning training runs. It detects correctness issues, such as code bugs and faulty hardware, early and pinpoints their root cause.

TrainCheck has detected silent errors in a wide range of real-world training scenarios, from large-scale LLM pretraining (such as BLOOM-176B) to small-scale tutorial runs by deep learning beginners.

📌 For a list of successful cases, see: TODO

What It Does

TrainCheck uses training invariants, which are semantic rules that describe expected behavior during training, to detect bugs as they happen. These invariants can be extracted from any correct run, including those produced by official examples and tutorials. There is no need to curate inputs or write manual assertions.

TrainCheck performs three core functions:

Instruments your training code
Inserts lightweight tracing into existing scripts (such as pytorch/examples or transformers) with minimal code changes.
Learns invariants from correct runs
Discovers expected relationships across APIs, tensors, and training steps to build a model of normal behavior.
Checks new or modified runs
Validates behavior against the learned invariants and flags silent errors, such as missing gradient clipping, weight desynchronization, or broken mixed precision, right when they occur.

This picture illustrates the TrainCheck workflow:

Under the hood, TrainCheck decomposes into three CLI tools:

Instrumentor (traincheck-collect) Wraps target training programs with lightweight tracing logic. It produces an instrumented version of the target program that logs API calls and model states without altering training semantics.
Inference Engine (traincheck-infer) Consumes one or more trace logs from successful runs to infer training invariants.
Checker (traincheck-check) Runs alongside or after new training jobs to verify that each recorded event satisfies the inferred invariants.

🔥 Try TrainCheck

Work through 5‑Minute Experience with TrainCheck. You’ll learn how to:

Instrument a training script and collect a trace
Automatically infer invariants
Uncover silent bugs in the training script

Documentation

Installation Guide
Usage Guide: Scenarios and Limitations
TrainCheck Technical Doc
TrainCheck Dev RoadMap

Status

TrainCheck is under active development. Please join our 💬 Discord server or file a GitHub issue for support. We welcome feedback and contributions from early adopters.

Contributing

We welcome and value any contributions and collaborations. Please check out Contributing to TrainCheck for how to get involved.

License

TrainCheck is licensed under the Apache License 2.0.

Citation

If TrainCheck is relevant to your work, please cite our paper:

@inproceedings{TrainCheckOSDI2025,
  author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
  title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
  booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
  series = {OSDI '25},
  month = {July},
  year = {2025},
  address = {Boston, MA, USA},
  publisher = {USENIX Association},
}

Artifact Evaluation

🕵️‍♀️ OSDI AE members, please see TrainCheck AE Guide.

Name		Name	Last commit message	Last commit date
Latest commit History 1,214 Commits
.github		.github
docs		docs
tests		tests
traincheck		traincheck
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
ROADMAP.md		ROADMAP.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TrainCheck: Training with Confidence

What It Does

🔥 Try TrainCheck

Documentation

Status

Contributing

License

Citation

Artifact Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 9

Uh oh!

Languages

License

OrderLab/TrainCheck

Folders and files

Latest commit

History

Repository files navigation

TrainCheck: Training with Confidence

What It Does

🔥 Try TrainCheck

Documentation

Status

Contributing

License

Citation

Artifact Evaluation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 9

Uh oh!

Languages

Packages