Skip to content

Commit 35c076d

Browse files
authored
Merge pull request #1 from Essoz/traincheck-osdi25
[Project] TrainCheck
2 parents d0b6f1e + e54f149 commit 35c076d

File tree

8 files changed

+51
-2
lines changed

8 files changed

+51
-2
lines changed
715 KB
Loading

assets/img/team/yuxuan.jpg

160 KB
Loading

index.html

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,28 @@ <h2 class="mb-3">Recent Projects</h2>
8585
<!-- single course -->
8686
<div class="col-lg-12">
8787
<div class="owl-theme owl-carousel active_course">
88+
<div class="single_recent_project">
89+
<div class="recent_project_head">
90+
<img class="img-fluid" src="/assets/img/project/traincheck_logo.png" alt="TrainCheck" />
91+
</div>
92+
<div class="recent_project_content">
93+
<h4 class="mb-3">
94+
<a href="#">Catching Silent Errors in Deep Learning Training</a>
95+
</h4>
96+
<p>
97+
Silent errors in deep learning training can silently waste
98+
thousands of GPU hours and produce low-quality models. We
99+
introduce TrainCheck, a proactive checking framework that learns
100+
semantic invariants from correct training runs and enforces them
101+
at runtime to catch failures early—before they silently
102+
accumulate cost and damage model reliability.
103+
</p>
104+
<div class="recent_project_meta d-flex justify-content-lg-between align-items-lg-center flex-lg-row flex-column mt-4">
105+
<a class="button button-light" href="paper/traincheck-osdi25-preprint.pdf" target="_blank">Read More</a>
106+
</div>
107+
</div>
108+
</div>
109+
88110
<div class="single_recent_project">
89111
<div class="recent_project_head">
90112
<img class="img-fluid" src="/assets/img/project/watchdog.jpg" alt="" />

news.html

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,14 @@
55
<section class="section-margin">
66
<div class="container">
77
<ul class="newslist">
8+
<li>
9+
<span class="newsicon"><i class="flaticon-document"></i></span><span class="newsdate">Mar 2025</span>
10+
<a href="https://github.com/OrderLab/TrainCheck"> TrainCheck</a> is accepted to appear at <a href="https://www.usenix.org/conference/osdi25">OSDI '25</a>
11+
<details>
12+
<summary>[...]</summary>
13+
Training deep learning (DL) models is a complex task involving multiple steps and various libraries, making DL training pipelines prone to silent bugs that lead to suboptimal or incorrect models. These issues are challenging to detect and diagnose. TrainCheck is the first framework that takes a proactive checking approach to systematically address silent issues. TrainCheck automatically infers invariants tailored for DL training. It uses these invariants to enhance a training task and proactively detect silent issues while providing debugging help.
14+
</details>
15+
</li>
816
<li>
917
<span class="newsicon"><i class="flaticon-distance"></i></span><span class="newsdate">May 2024</span>
1018
<span class="text-danger">Yigong will join Boston University as an Assistant Professor!</span>

paper/traincheck-osdi25-preprint.pdf

622 KB
Binary file not shown.

paper/traincheck-osdi25.bib

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
@inproceedings{TrainCheckOSDI2025,
2+
author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
3+
title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
4+
booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
5+
series = {OSDI '25},
6+
month = {July},
7+
year = {2025},
8+
address = {Boston, MA, USA},
9+
publisher = {USENIX Association},
10+
}

pubs.html

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,10 @@
77
<h2 id="publications">2025</h2>
88
<ul class="publications">
99
<li>
10-
<a target="_blank" href="#">Training with Confidence: Catching Silent DL Training Bugs with Automated Proactive Checks</a><br>
10+
<a target="_blank" href="paper/traincheck-osdi25-preprint.pdf">Training with Confidence: Catching Silent DL Training Bugs with Automated Proactive Checks</a><br>
1111
<span class="authorlist"><i><a href="https://essoz.github.io" class="nodec">Yuxuan Jiang</a>, </i><i>Ziming Zhou, </i><i>Boyu Xu, </i><i>Beijie Liu, </i><i>Runhui Xu, </i><i><a href="https://web.eecs.umich.edu/~ryanph" class="nodec">Peng Huang</a><br></i></span>
12-
<a target="_blank" href="https://www.usenix.org/conference/osdi25" class="conf"><b>OSDI 2025</b></a>&nbsp;&nbsp;<a target="_blank" class="btn btn-outline-primary publinkitem" href="https://github.com/OrderLab/TrainCheck">Software</a>
12+
<a target="_blank" href="https://www.usenix.org/conference/osdi25" class="conf"><b>OSDI 2025</b></a>&nbsp;&nbsp;<a target="_blank" class="btn btn-outline-primary publinkitem" href="paper/traincheck-osdi25.bib">BibTeX</a>
13+
&nbsp;&nbsp;<a target="_blank" class="btn btn-outline-primary publinkitem" href="https://github.com/OrderLab/TrainCheck">Software</a>
1314
</li>
1415
<li>
1516
<a target="_blank" href="#">Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems</a><br>

software.html

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,14 @@ <h3 class="section-intro__title">Group GitHub Repository</h3>
1010
</div>
1111
</div>
1212
</section>
13+
<section class="section-padding bg-magnolia">
14+
<div class="container">
15+
<div class="section-intro pb-85px text-center">
16+
<h3 class="section-intro__title">TrainCheck [<a href="/paper/violet-osdi20-preprint.pdf">OSDI '25</a>]</h3>
17+
<p class="section-intro__subtitle">TrainCheck is an innovative tool for detecting silent errors in deep learning training. We are excited to open-source TrainCheck–explore the project and get involved on <a href="https://github.com/OrderLab/TrainCheck">GitHub</a>!</p>
18+
</div>
19+
</div>
20+
</section>
1321
<section class="section-padding bg-magnolia">
1422
<div class="container">
1523
<div class="section-intro pb-85px text-center">

0 commit comments

Comments
 (0)