ci(windows): retry MSYS2 S3 fetch on transient 404 / connection blips#5995
Merged
ci(windows): retry MSYS2 S3 fetch on transient 404 / connection blips#5995
Conversation
…lips The "Install MSYS2 for MRBind" step downloads https://vcpkg-export.s3.us-east-1.amazonaws.com/msys64_meshlib_mrbind.zip via PowerShell's Net.WebClient.DownloadFile. We've seen the bare call fail with `(404) Not Found` on a perfectly valid asset (transient S3 / edge-cache blip), failing the whole windows-build-test job ~5 minutes into the run. Wrap the call in a 5-attempt loop with exponential backoff (10s -> 20s -> 40s -> 80s, capped by the 5-attempt count, ~150s worst case before giving up). On the 5th failure the original exception is re-thrown so genuinely-missing assets still fail loudly. Same edit at both download sites (build-test-windows.yml and the identical step in pip-build.yml's macos-pip-build leg... actually pip-build's manylinux/macos-pip-build legs share the snippet). Net.WebClient is kept rather than swapping to Invoke-WebRequest -MaximumRetryCount because PowerShell 6+'s built-in retry only fires on 408 / 429 / 5xx -- a 404 from S3 wouldn't be retried natively even with that flag, so manual try/catch is required regardless.
Grantim
approved these changes
Apr 27, 2026
adalisk-emikhaylov
approved these changes
Apr 27, 2026
The PowerShell retry-and-extract block was duplicated verbatim in build-test-windows.yml and pip-build.yml. Move it to .github/actions/install-msys2-mrbind/action.yml so future tweaks (retry tuning, checksum verification, mirror fallback, etc.) only need to land in one place. The if: condition stays at each call site (composite actions can't gate themselves at top level). The inline 18-line block in each workflow becomes a single uses: line.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The "Install MSYS2 for MRBind" step on Windows downloads
https://vcpkg-export.s3.us-east-1.amazonaws.com/msys64_meshlib_mrbind.zipvia PowerShell'sNet.WebClient.DownloadFile. The bare call has been observed failing with(404) Not Foundon a perfectly valid asset — transient S3 / edge-cache blip — failing the wholewindows-build-testjob ~5 minutes into the run.Most recent example: PR #5994's CI run, job 73175998223:
PR #5994 only touches macOS-specific files; the Windows job dying on a missing-S3-object is unrelated to that PR's diff. The same failure has presumably been hitting other PRs that touch unrelated areas — it's an infrastructure flake, not a code change.
Fix
Two parts:
Wrap the
DownloadFilecall in a 5-attempt retry loop with exponential backoff (10 s → 20 s → 40 s → 80 s, ~150 s worst case before giving up). On the final attempt's failure the original exception is re-thrown so genuinely-missing assets still fail loudly (with the sameWebExceptionmessage, just after$maxAttempts - 1retries instead of 0). Transient blips that recover within ~2 minutes get absorbed.Extract the snippet into a composite action at
.github/actions/install-msys2-mrbind/action.yml. The same retry-and-extract block was needed in two workflows (build-test-windows.ymlandpip-build.yml); putting it behind oneuses:indirection means future tweaks (retry tuning, checksum verification, mirror fallback) land in one place.Composite action
.github/actions/install-msys2-mrbind/action.yml:Naming and structure match the four existing composite actions under
.github/actions/(get-aws-instance-type,collect-runner-stats,collect-artifact-stats,python-regression-tests).Call sites
Both
build-test-windows.ymlandpip-build.ymlshrink from a 21-line named step (with the inlinerun: |block) to a 3-lineuses::The
if:condition stays at each call site — composite actions can't gate themselves at the top level, so the workflow-side guard is what keeps the action from running on matrix branches that don't need MRBind/MSYS2.Why not
Invoke-WebRequest -MaximumRetryCount?PowerShell 6+'s
Invoke-WebRequesthas built-in retry via-MaximumRetryCountand-RetryIntervalSec, but it only retries on 408 / 429 / 5xx — a 404 from S3 is a 4xx and wouldn't be retried natively. So manual try/catch is required regardless. KeepingNet.WebClientmatches the existing call shape; the only change is the retry envelope.Diff
3 files, +29 / −40. New
.github/actions/install-msys2-mrbind/action.yml(27 lines) replaces the duplicated 18-line inline block at two call sites.CI verification
Run 24992139024 on the inline-snippet variant of this PR — all 4 Windows legs green. The retry loop's pwsh code printed in the step log (so the script ran) but no
Download attempt N failedmessage — the S3 fetch succeeded on the first attempt. The retry envelope is invisible when the network behaves; only the worst-case path fires log output. The composite-action refactor is functionally identical (same shell, same code, just relocated), so the same green outcome is expected.Labels
Disabled non-Windows platforms (this PR is Windows-only).
🤖 Generated with Claude Code