Skip to content

Track potential intermittent/flaky CI test failures (NOT confirmed bugs — code/CI/environment triage) #6518

Description

@hjmjohnson

Important

These are NOT confirmed bugs. They are recurring intermittent CI failures
surfaced by a 30-day automated review, offered as a starting point for
triage. Each candidate may turn out to be a code issue (e.g. a real race),
a test issue (fragile assertion, timing assumption), or a CI/environment
issue (runner resource pressure, download flakiness, dashboard scripting).
They are listed so the failure modes can be tracked and investigated, not as
assertions of defects.

The environment for the failures is often the root cause. For example, timeouts under VALGRIND, or similar.

Summary

A 30-day review (2026-05-25 → 06-25) of ITK CI failures across GitHub Actions
(ITK.Pixi, ITK.Arm64) and CDash (Azure DevOps + Kitware nightlies;
~18,920 failed-test rows, 676 distinct test names) found that most red CI is
CI-environment/infrastructure, with a small set of tests that fail
intermittently across unrelated commits and multiple platforms. This issue
tracks both so fixes/mitigations can be prioritized.

Methodology & how to read the tables

Failures were aggregated by test name × platform × commit/day. The signal that
separates a likely-flaky test from a one-off breakage:

  • Cross-commit (fails on several unrelated source revisions) and/or
  • Cross-platform + multi-day (fails on Linux and Windows and macOS over
    many distinct days).

A test failing many times on a single day / single build is almost always a
transient breakage that was then fixed — those are excluded below.

A. CI-environment / infrastructure failure modes (dominate red runs)

These are not test bugs; they prevent runs from producing a meaningful result.

Mode Observed Likely layer
ExternalData download failures (CID/tarball fetch) ~39% of red ITK.Pixi runs CI/network/cache
Dashboard exit 255 "warnings-as-fatal" (0 real test failures) ~74% of red ITK.Arm64 runs CI scripting (ci_completed_successfully)
Runner setup / apt-disk / tool-setup (exit 127) several ITK.Pixi runs CI/runner
One Windows nightly config red for many days (Windows11-VS22x64-RelWithDebInfo-Favorite-Remotes) ~32 tests × 15 days single machine/config

B. Tests that fail intermittently across commits/platforms (candidates only)

Candidate Evidence Working hypothesis (unconfirmed)
ParallelSparseFieldLevelSetRobustness.SweepRepeat / .ConcurrentMultiPipeline 132× / 102× over 18 days; Linux+Win+macOS incl. TSan Most-substantiated. Test is a self-described deadlock reproducer; points at gang-scheduling assumptions in ParallelSparseFieldLevelSetImageFilter under pool/TBB backends. Investigation in progress.
ShapeLabelMapFixture (5 3D_* cases) 187× / 30 days. Two modes: 3 non-Direction cases Windows-MSVC-only (105×); 2 _Direction cases cross-platform (Win+macOS+Linux) "Resulting value" baselines stored at only 4–5 decimals skewed the fixed 1e-4 window. PR #6521 explores both fixes: first the tolerance-plan change (relative tolerance, loosen-only), now baseline regeneration at full double precision + tightened tolerances. CI to confirm which holds on MSVC.
itkMeshFileReadWriteTest01 4 unrelated commits; ubuntu+windows Mesh round-trip FP-formatting/precision — likely test tolerance
PythonLazyImportTime 115× / 18 days; Win+Linux Python Asserts a wall-clock threshold — fragile on shared/loaded hosts (test design)
MaskedAssignFixture.SetGetPrint 276× / 18 days; concentrated in valgrind/coverage Instrumentation slowdown/timeout — environment interaction
itkComposeBigVectorImageFilterTest 92× / 18 days; mostly coverage builds Resource/time under coverage instrumentation
Explicitly excluded (single-day transients — NOT flakes)

ComputeImageSpectralDensityTest (06-12), GeneralizedEigenDecomposition.IdentityPencilMatchesStandard
(06-19), itkImageFileWriterStreamingTest1_2/_3 (06-14), and a 120-test mass
red on one Windows nightly (06-25) — each a single bad build/commit that was
fixed, not recurring flakiness.

Suggested next steps

  • Treat Section A as infrastructure work (ExternalData fetch retry/caching;
    reconsider warnings-as-fatal on ARM) — likely the biggest reliability payoff.
  • Treat Section B as per-test investigations; confirm/deny each before any
    code change. Re-running a failed job and checking whether a different test
    fails is a quick flake/real discriminator.

Happy to break any individual candidate into its own focused issue once it's
confirmed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions