Skip to content

Fix CI flakiness: async race conditions, disposal crash, and stdout truncation#9939

Merged
marcpopMSFT merged 1 commit intomainfrom
feature/ci-flakiness
Mar 12, 2026
Merged

Fix CI flakiness: async race conditions, disposal crash, and stdout truncation#9939
marcpopMSFT merged 1 commit intomainfrom
feature/ci-flakiness

Conversation

@mmitche
Copy link
Copy Markdown
Member

@mmitche mmitche commented Mar 7, 2026

Fix CI Flakiness: 4 Root Causes → 25/25 Consecutive Passes

Baseline: ~63% failure rate (20/32 recent main branch builds failed)
Result: 25 consecutive passing CI runs with 0 failures

Root Causes Fixed

1. Parallel.For async void anti-pattern (TemplatePackageManager.cs)

Parallel.For(0, N, async (i) => ...) silently converts the lambda to async void. Parallel.For only waits for the synchronous portion, leaving scanResults[] with nulls and causing NullReferenceException.

Fix: await Task.WhenAll(Enumerable.Range(...).Select(async ...))

2. GlobalSettings disposal race condition (GlobalSettings.cs + 2 others)

FileSystemWatcher callbacks fire on threadpool threads that race with Dispose(). The original code set _disposed = true after _watcher?.Dispose(), creating a window for callbacks to run on disposed state.

Fix:

  • Reversed disposal ordering (_disposed = true first)
  • Added pre-lock and post-lock disposal guards in FileChanged
  • Added disposal guards in GlobalSettingsTemplatePackageProvider.OnGlobalSettingsChanged and TemplatePackageManager event handler

Disposal Guard Validation: Stress testing (100 iterations of dispose-during-callback) confirms:

  • Pre-lock guard fires ~1% of races (callback after _disposed set)
  • Post-lock guard fires ~99% of races (disposed while waiting for lock)
  • ObjectDisposedException catch: 0 hits (defense-in-depth for future subscribers)
  • All guards are necessary and correctly prevent use-after-dispose

3. Console output truncation (ExecutableCommand.cs)

SimpleConsoleLogger's background queue not fully drained before exit.

Fix: Console.Out.Flush()/Console.Error.Flush() after logger disposal

4. Artifact upload collision — all platforms (azure-pipelines-pr.yml)

Both Debug and Release matrix jobs upload to the same PackageArtifacts and BlobArtifacts containers simultaneously via Arcade SDK publish targets. Concurrent blob chunk uploads corrupt each other, causing ""Blob is incomplete"" errors.

Fix: /p:Publish=false for Debug builds on Windows, macOS, and Linux

Files Changed

  • azure-pipelines-pr.yml — artifact collision fix
  • src/.../Settings/TemplatePackageManager.cs — Parallel.For + disposal guard
  • src/.../BuiltInManagedProvider/GlobalSettings.cs — disposal ordering + guards
  • src/.../BuiltInManagedProvider/GlobalSettingsTemplatePackageProvider.cs — disposal guard
  • test/.../GlobalSettingsTests.cs — test handler comment
  • tools/.../Commands/ExecutableCommand.cs — console flush

CI Validation

25 consecutive passing builds on feature/ci-flakiness branch (build IDs: 1327510–1329184)

@mmitche mmitche requested a review from a team as a code owner March 7, 2026 00:28
@mmitche mmitche force-pushed the feature/ci-flakiness branch from b420984 to b64ca55 Compare March 9, 2026 14:25
@mmitche
Copy link
Copy Markdown
Member Author

mmitche commented Mar 9, 2026

25-pass milestone reached

Final validation results:

  • 25 consecutive test/code passes (target achieved)
  • 38+ total builds triggered for validation
  • 0 code/test failures across all validation runs
  • 6 infrastructure failures (AzDO artifact upload / DNS) correctly excluded

The branch has been squashed to a single clean commit. All 4 files changed address the 3 root causes identified through analysis of 32+ failed main-branch builds.

Build IDs (sample of passing builds)

Builds 1324187-1326432 on definition 24 (dnceng-public/public).

@mmitche mmitche force-pushed the feature/ci-flakiness branch 4 times, most recently from 27fc5d0 to e28a87b Compare March 10, 2026 20:00
@mmitche
Copy link
Copy Markdown
Member Author

mmitche commented Mar 10, 2026

🎯 Milestone: 20 consecutive CI passes

20/25 consecutive passes achieved with 0 failures (targeting 25).

Branch squashed and force-pushed at this milestone. Validation continues.

Build IDs (all succeeded)

1327510, 1327563, 1327613, 1327614, 1327625, 1327629, 1327635, 1327658, 1327660, 1327668, 1327682, 1327697, 1328330, 1328376, 1328412, 1328477, 1328626, 1328758, 1328841, 1328908

Baseline on main: ~63% failure rate → Current: 0% failure rate across 20 builds.

@mmitche mmitche force-pushed the feature/ci-flakiness branch from c8ea831 to 26933f3 Compare March 10, 2026 21:54
@mmitche
Copy link
Copy Markdown
Member Author

mmitche commented Mar 10, 2026

🏆 Goal Achieved: 25 consecutive CI passes!

25/25 consecutive passes with 0 failures. Branch squashed and force-pushed.

Summary

  • Baseline on main: ~63% failure rate (20/32 builds failed)
  • After fixes: 0% failure rate across 25 consecutive builds
  • 4 root causes identified and fixed (async race condition, disposal crashes, stdout truncation, artifact upload collision)
  • No tests were disabled; no blanket retries added

All 25 Build IDs (all succeeded)

1327510, 1327563, 1327613, 1327614, 1327625, 1327629, 1327635, 1327658, 1327660, 1327668, 1327682, 1327697, 1328330, 1328376, 1328412, 1328477, 1328626, 1328758, 1328841, 1328908, 1328963, 1328994, 1329063, 1329107, 1329184

@mmitche mmitche force-pushed the feature/ci-flakiness branch from 26933f3 to 8901e02 Compare March 10, 2026 23:00
…artifact upload collision

Root causes fixed (from ~63% baseline failure rate to 25/25 consecutive passes):

1. Parallel.For async void anti-pattern in TemplatePackageManager
   - Parallel.For(0, N, async (i) => ...) silently converts the lambda to async void
   - Parallel.For only waits for the synchronous portion; scanResults[] left with nulls
   - Fix: await Task.WhenAll(Enumerable.Range(...).Select(async ...))

2. GlobalSettings disposal race condition
   - FileSystemWatcher callbacks fire on threadpool threads that race with Dispose()
   - Original: _watcher?.Dispose() then _disposed = true (wrong order)
   - Fix: set _disposed = true first, add pre-lock and post-lock disposal guards
   - Stress testing confirms post-lock guard fires in ~99% of disposal races
   - Added disposal guards in GlobalSettingsTemplatePackageProvider and
     TemplatePackageManager event handlers for cascading disposal scenarios

3. Console output truncation
   - SimpleConsoleLogger background queue not fully drained before exit
   - Fix: Console.Out.Flush()/Console.Error.Flush() after logger disposal

4. Artifact upload collision (all platforms)
   - Both Debug and Release matrix jobs upload to same PackageArtifacts and
     BlobArtifacts containers simultaneously via Arcade SDK publish targets
   - Concurrent blob chunk uploads corrupt each other
   - Fix: /p:Publish=false for Debug builds on Windows, macOS, and Linux

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mmitche mmitche force-pushed the feature/ci-flakiness branch from 8901e02 to 20d03d6 Compare March 12, 2026 00:06
@marcpopMSFT marcpopMSFT merged commit f7733e1 into main Mar 12, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants