You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These are NOT confirmed bugs. They are recurring intermittent CI failures
surfaced by a 30-day automated review, offered as a starting point for
triage. Each candidate may turn out to be a code issue (e.g. a real race),
a test issue (fragile assertion, timing assumption), or a CI/environment
issue (runner resource pressure, download flakiness, dashboard scripting).
They are listed so the failure modes can be tracked and investigated, not as
assertions of defects.
The environment for the failures is often the root cause. For example, timeouts under VALGRIND, or similar.
Summary
A 30-day review (2026-05-25 → 06-25) of ITK CI failures across GitHub Actions
(ITK.Pixi, ITK.Arm64) and CDash (Azure DevOps + Kitware nightlies;
~18,920 failed-test rows, 676 distinct test names) found that most red CI is CI-environment/infrastructure, with a small set of tests that fail
intermittently across unrelated commits and multiple platforms. This issue
tracks both so fixes/mitigations can be prioritized.
Methodology & how to read the tables
Failures were aggregated by test name × platform × commit/day. The signal that
separates a likely-flaky test from a one-off breakage:
Cross-commit (fails on several unrelated source revisions) and/or
Cross-platform + multi-day (fails on Linux and Windows and macOS over
many distinct days).
A test failing many times on a single day / single build is almost always a
transient breakage that was then fixed — those are excluded below.
A. CI-environment / infrastructure failure modes (dominate red runs)
These are not test bugs; they prevent runs from producing a meaningful result.
132× / 102× over 18 days; Linux+Win+macOS incl. TSan
Most-substantiated. Test is a self-described deadlock reproducer; points at gang-scheduling assumptions in ParallelSparseFieldLevelSetImageFilter under pool/TBB backends. Investigation in progress.
"Resulting value" baselines stored at only 4–5 decimals skewed the fixed 1e-4 window. PR #6521 explores both fixes: first the tolerance-plan change (relative tolerance, loosen-only), now baseline regeneration at full double precision + tightened tolerances. CI to confirm which holds on MSVC.
itkMeshFileReadWriteTest01
4 unrelated commits; ubuntu+windows
Mesh round-trip FP-formatting/precision — likely test tolerance
PythonLazyImportTime
115× / 18 days; Win+Linux Python
Asserts a wall-clock threshold — fragile on shared/loaded hosts (test design)
Explicitly excluded (single-day transients — NOT flakes)
ComputeImageSpectralDensityTest (06-12), GeneralizedEigenDecomposition.IdentityPencilMatchesStandard
(06-19), itkImageFileWriterStreamingTest1_2/_3 (06-14), and a 120-test mass
red on one Windows nightly (06-25) — each a single bad build/commit that was
fixed, not recurring flakiness.
Suggested next steps
Treat Section A as infrastructure work (ExternalData fetch retry/caching;
reconsider warnings-as-fatal on ARM) — likely the biggest reliability payoff.
Treat Section B as per-test investigations; confirm/deny each before any
code change. Re-running a failed job and checking whether a different test
fails is a quick flake/real discriminator.
Happy to break any individual candidate into its own focused issue once it's
confirmed.
Important
These are NOT confirmed bugs. They are recurring intermittent CI failures
surfaced by a 30-day automated review, offered as a starting point for
triage. Each candidate may turn out to be a code issue (e.g. a real race),
a test issue (fragile assertion, timing assumption), or a CI/environment
issue (runner resource pressure, download flakiness, dashboard scripting).
They are listed so the failure modes can be tracked and investigated, not as
assertions of defects.
Summary
A 30-day review (2026-05-25 → 06-25) of ITK CI failures across GitHub Actions
(
ITK.Pixi,ITK.Arm64) and CDash (Azure DevOps + Kitware nightlies;~18,920 failed-test rows, 676 distinct test names) found that most red CI is
CI-environment/infrastructure, with a small set of tests that fail
intermittently across unrelated commits and multiple platforms. This issue
tracks both so fixes/mitigations can be prioritized.
Methodology & how to read the tables
Failures were aggregated by test name × platform × commit/day. The signal that
separates a likely-flaky test from a one-off breakage:
many distinct days).
A test failing many times on a single day / single build is almost always a
transient breakage that was then fixed — those are excluded below.
A. CI-environment / infrastructure failure modes (dominate red runs)
These are not test bugs; they prevent runs from producing a meaningful result.
ITK.Pixirunsexit 255"warnings-as-fatal" (0 real test failures)ITK.Arm64runsci_completed_successfully)ITK.PixirunsWindows11-VS22x64-RelWithDebInfo-Favorite-Remotes)B. Tests that fail intermittently across commits/platforms (candidates only)
ParallelSparseFieldLevelSetRobustness.SweepRepeat/.ConcurrentMultiPipelineParallelSparseFieldLevelSetImageFilterunder pool/TBB backends. Investigation in progress.ShapeLabelMapFixture(53D_*cases)Directioncases Windows-MSVC-only (105×); 2_Directioncases cross-platform (Win+macOS+Linux)1e-4window. PR #6521 explores both fixes: first the tolerance-plan change (relative tolerance, loosen-only), now baseline regeneration at fulldoubleprecision + tightened tolerances. CI to confirm which holds on MSVC.itkMeshFileReadWriteTest01PythonLazyImportTimeMaskedAssignFixture.SetGetPrintitkComposeBigVectorImageFilterTestExplicitly excluded (single-day transients — NOT flakes)
ComputeImageSpectralDensityTest(06-12),GeneralizedEigenDecomposition.IdentityPencilMatchesStandard(06-19),
itkImageFileWriterStreamingTest1_2/_3(06-14), and a 120-test massred on one Windows nightly (06-25) — each a single bad build/commit that was
fixed, not recurring flakiness.
Suggested next steps
reconsider warnings-as-fatal on ARM) — likely the biggest reliability payoff.
code change. Re-running a failed job and checking whether a different test
fails is a quick flake/real discriminator.
Happy to break any individual candidate into its own focused issue once it's
confirmed.