Track potential intermittent/flaky CI test failures (NOT confirmed bugs — code/CI/environment triage)

> [!IMPORTANT]
> **These are NOT confirmed bugs.** They are recurring *intermittent* CI failures
> surfaced by a 30-day automated review, offered as a starting point for
> triage. Each candidate may turn out to be a **code** issue (e.g. a real race),
> a **test** issue (fragile assertion, timing assumption), or a **CI/environment**
> issue (runner resource pressure, download flakiness, dashboard scripting).
> They are listed so the failure modes can be tracked and investigated, not as
> assertions of defects.

> The environment for the failures is often the root cause.  For example, timeouts under VALGRIND, or similar. 

## Summary

A 30-day review (2026-05-25 → 06-25) of ITK CI failures across **GitHub Actions**
(`ITK.Pixi`, `ITK.Arm64`) and **CDash** (Azure DevOps + Kitware nightlies;
~18,920 failed-test rows, 676 distinct test names) found that most red CI is
**CI-environment/infrastructure**, with a small set of tests that fail
intermittently across *unrelated commits and multiple platforms*. This issue
tracks both so fixes/mitigations can be prioritized.

<details>
<summary>Methodology &amp; how to read the tables</summary>

Failures were aggregated by test name × platform × commit/day. The signal that
separates a likely-flaky test from a one-off breakage:
- **Cross-commit** (fails on several unrelated source revisions) and/or
- **Cross-platform + multi-day** (fails on Linux *and* Windows *and* macOS over
  many distinct days).

A test failing many times on a **single day / single build** is almost always a
transient breakage that was then fixed — those are excluded below.
</details>

## A. CI-environment / infrastructure failure modes (dominate red *runs*)

These are not test bugs; they prevent runs from producing a meaningful result.

| Mode | Observed | Likely layer |
|---|---|---|
| ExternalData download failures (CID/tarball fetch) | ~39% of red `ITK.Pixi` runs | CI/network/cache |
| Dashboard `exit 255` "warnings-as-fatal" (0 real test failures) | ~74% of red `ITK.Arm64` runs | CI scripting (`ci_completed_successfully`) |
| Runner setup / apt-disk / tool-setup (exit 127) | several `ITK.Pixi` runs | CI/runner |
| One Windows nightly config red for many days (`Windows11-VS22x64-RelWithDebInfo-Favorite-Remotes`) | ~32 tests × 15 days | single machine/config |

## B. Tests that fail intermittently across commits/platforms (candidates only)

| Candidate | Evidence | Working hypothesis (unconfirmed) |
|---|---|---|
| `ParallelSparseFieldLevelSetRobustness.SweepRepeat` / `.ConcurrentMultiPipeline` | 132× / 102× over 18 days; Linux+Win+macOS incl. TSan | **Most-substantiated.** Test is a self-described deadlock reproducer; points at gang-scheduling assumptions in `ParallelSparseFieldLevelSetImageFilter` under pool/TBB backends. Investigation in progress. |
| `ShapeLabelMapFixture` (5 `3D_*` cases) | 187× / 30 days. Two modes: 3 non-`Direction` cases **Windows-MSVC-only** (105×); 2 `_Direction` cases **cross-platform** (Win+macOS+Linux) | "Resulting value" baselines stored at only 4–5 decimals skewed the fixed `1e-4` window. **PR #6521** explores both fixes: first the *tolerance-plan* change (relative tolerance, loosen-only), now *baseline regeneration* at full `double` precision + tightened tolerances. CI to confirm which holds on MSVC. |
| `itkMeshFileReadWriteTest01` | 4 unrelated commits; ubuntu+windows | Mesh round-trip FP-formatting/precision — likely test tolerance |
| `PythonLazyImportTime` | 115× / 18 days; Win+Linux Python | Asserts a wall-clock threshold — fragile on shared/loaded hosts (test design) |
| `MaskedAssignFixture.SetGetPrint` | 276× / 18 days; concentrated in valgrind/coverage | Instrumentation slowdown/timeout — environment interaction |
| `itkComposeBigVectorImageFilterTest` | 92× / 18 days; mostly coverage builds | Resource/time under coverage instrumentation |

<details>
<summary>Explicitly excluded (single-day transients — NOT flakes)</summary>

`ComputeImageSpectralDensityTest` (06-12), `GeneralizedEigenDecomposition.IdentityPencilMatchesStandard`
(06-19), `itkImageFileWriterStreamingTest1_2/_3` (06-14), and a 120-test mass
red on one Windows nightly (06-25) — each a single bad build/commit that was
fixed, not recurring flakiness.
</details>

## Suggested next steps

- Treat **Section A** as infrastructure work (ExternalData fetch retry/caching;
  reconsider warnings-as-fatal on ARM) — likely the biggest reliability payoff.
- Treat **Section B** as per-test investigations; confirm/deny each before any
  code change. Re-running a failed job and checking whether a *different* test
  fails is a quick flake/real discriminator.

Happy to break any individual candidate into its own focused issue once it's
confirmed.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Track potential intermittent/flaky CI test failures (NOT confirmed bugs — code/CI/environment triage) #6518

Summary

A. CI-environment / infrastructure failure modes (dominate red runs)

B. Tests that fail intermittently across commits/platforms (candidates only)

Suggested next steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Mode	Observed	Likely layer
ExternalData download failures (CID/tarball fetch)	~39% of red `ITK.Pixi` runs	CI/network/cache
Dashboard `exit 255` "warnings-as-fatal" (0 real test failures)	~74% of red `ITK.Arm64` runs	CI scripting (`ci_completed_successfully`)
Runner setup / apt-disk / tool-setup (exit 127)	several `ITK.Pixi` runs	CI/runner
One Windows nightly config red for many days (`Windows11-VS22x64-RelWithDebInfo-Favorite-Remotes`)	~32 tests × 15 days	single machine/config

Candidate	Evidence	Working hypothesis (unconfirmed)
`ParallelSparseFieldLevelSetRobustness.SweepRepeat` / `.ConcurrentMultiPipeline`	132× / 102× over 18 days; Linux+Win+macOS incl. TSan	Most-substantiated. Test is a self-described deadlock reproducer; points at gang-scheduling assumptions in `ParallelSparseFieldLevelSetImageFilter` under pool/TBB backends. Investigation in progress.
`ShapeLabelMapFixture` (5 `3D_*` cases)	187× / 30 days. Two modes: 3 non-`Direction` cases Windows-MSVC-only (105×); 2 `_Direction` cases cross-platform (Win+macOS+Linux)	"Resulting value" baselines stored at only 4–5 decimals skewed the fixed `1e-4` window. PR #6521 explores both fixes: first the tolerance-plan change (relative tolerance, loosen-only), now baseline regeneration at full `double` precision + tightened tolerances. CI to confirm which holds on MSVC.
`itkMeshFileReadWriteTest01`	4 unrelated commits; ubuntu+windows	Mesh round-trip FP-formatting/precision — likely test tolerance
`PythonLazyImportTime`	115× / 18 days; Win+Linux Python	Asserts a wall-clock threshold — fragile on shared/loaded hosts (test design)
`MaskedAssignFixture.SetGetPrint`	276× / 18 days; concentrated in valgrind/coverage	Instrumentation slowdown/timeout — environment interaction
`itkComposeBigVectorImageFilterTest`	92× / 18 days; mostly coverage builds	Resource/time under coverage instrumentation

Uh oh!

Uh oh!

Track potential intermittent/flaky CI test failures (NOT confirmed bugs — code/CI/environment triage) #6518

Description

Summary

A. CI-environment / infrastructure failure modes (dominate red runs)

B. Tests that fail intermittently across commits/platforms (candidates only)

Suggested next steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions