Skip to content

ci(windows): retry MSYS2 S3 fetch on transient 404 / connection blips#5995

Merged
Fedr merged 3 commits intomasterfrom
ci/windows-msys2-download-retry
Apr 27, 2026
Merged

ci(windows): retry MSYS2 S3 fetch on transient 404 / connection blips#5995
Fedr merged 3 commits intomasterfrom
ci/windows-msys2-download-retry

Conversation

@Fedr
Copy link
Copy Markdown
Contributor

@Fedr Fedr commented Apr 27, 2026

Problem

The "Install MSYS2 for MRBind" step on Windows downloads https://vcpkg-export.s3.us-east-1.amazonaws.com/msys64_meshlib_mrbind.zip via PowerShell's Net.WebClient.DownloadFile. The bare call has been observed failing with (404) Not Found on a perfectly valid asset — transient S3 / edge-cache blip — failing the whole windows-build-test job ~5 minutes into the run.

Most recent example: PR #5994's CI run, job 73175998223:

Exception calling "DownloadFile" with "2" argument(s):
  "The remote server returned an error: (404) Not Found."
##[error]Process completed with exit code 1.

PR #5994 only touches macOS-specific files; the Windows job dying on a missing-S3-object is unrelated to that PR's diff. The same failure has presumably been hitting other PRs that touch unrelated areas — it's an infrastructure flake, not a code change.

Fix

Two parts:

  1. Wrap the DownloadFile call in a 5-attempt retry loop with exponential backoff (10 s → 20 s → 40 s → 80 s, ~150 s worst case before giving up). On the final attempt's failure the original exception is re-thrown so genuinely-missing assets still fail loudly (with the same WebException message, just after $maxAttempts - 1 retries instead of 0). Transient blips that recover within ~2 minutes get absorbed.

  2. Extract the snippet into a composite action at .github/actions/install-msys2-mrbind/action.yml. The same retry-and-extract block was needed in two workflows (build-test-windows.yml and pip-build.yml); putting it behind one uses: indirection means future tweaks (retry tuning, checksum verification, mirror fallback) land in one place.

Composite action

.github/actions/install-msys2-mrbind/action.yml:

name: 'Install MSYS2 for MRBind'
description: 'Download the MeshLib MSYS2 archive from S3 with retry on transient failures and extract to C:\'
runs:
  using: 'composite'
  steps:
    - name: Install MSYS2 for MRBind
      shell: pwsh
      run: |
        $url = "https://vcpkg-export.s3.us-east-1.amazonaws.com/msys64_meshlib_mrbind.zip"
        $dest = "./msys64_meshlib_mrbind.zip"
        $maxAttempts = 5
        $delay = 10
        for ($i = 1; $i -le $maxAttempts; $i++) {
          try {
            (New-Object Net.WebClient).DownloadFile($url, $dest)
            break
          } catch {
            if ($i -eq $maxAttempts) { throw }
            Write-Host "Download attempt $i failed: $($_.Exception.Message). Retrying in $delay s..."
            Start-Sleep -Seconds $delay
            $delay *= 2
          }
        }
        [IO.Compression.ZipFile]::ExtractToDirectory($dest, "C:\")
        rm $dest

Naming and structure match the four existing composite actions under .github/actions/ (get-aws-instance-type, collect-runner-stats, collect-artifact-stats, python-regression-tests).

Call sites

Both build-test-windows.yml and pip-build.yml shrink from a 21-line named step (with the inline run: | block) to a 3-line uses::

- name: Install MSYS2 for MRBind
  if: ${{ inputs.mrbind || (inputs.mrbind_c && matrix.build_system == 'CMake') || env.BUILD_C_SHARP == 'true' }}
  uses: ./.github/actions/install-msys2-mrbind

The if: condition stays at each call site — composite actions can't gate themselves at the top level, so the workflow-side guard is what keeps the action from running on matrix branches that don't need MRBind/MSYS2.

Why not Invoke-WebRequest -MaximumRetryCount?

PowerShell 6+'s Invoke-WebRequest has built-in retry via -MaximumRetryCount and -RetryIntervalSec, but it only retries on 408 / 429 / 5xx — a 404 from S3 is a 4xx and wouldn't be retried natively. So manual try/catch is required regardless. Keeping Net.WebClient matches the existing call shape; the only change is the retry envelope.

Diff

3 files, +29 / −40. New .github/actions/install-msys2-mrbind/action.yml (27 lines) replaces the duplicated 18-line inline block at two call sites.

CI verification

Run 24992139024 on the inline-snippet variant of this PR — all 4 Windows legs green. The retry loop's pwsh code printed in the step log (so the script ran) but no Download attempt N failed message — the S3 fetch succeeded on the first attempt. The retry envelope is invisible when the network behaves; only the worst-case path fires log output. The composite-action refactor is functionally identical (same shell, same code, just relocated), so the same green outcome is expected.

Labels

Disabled non-Windows platforms (this PR is Windows-only).

🤖 Generated with Claude Code

…lips

The "Install MSYS2 for MRBind" step downloads
https://vcpkg-export.s3.us-east-1.amazonaws.com/msys64_meshlib_mrbind.zip
via PowerShell's Net.WebClient.DownloadFile. We've seen the bare call
fail with `(404) Not Found` on a perfectly valid asset (transient S3 /
edge-cache blip), failing the whole windows-build-test job ~5 minutes
into the run.

Wrap the call in a 5-attempt loop with exponential backoff (10s -> 20s
-> 40s -> 80s, capped by the 5-attempt count, ~150s worst case before
giving up). On the 5th failure the original exception is re-thrown so
genuinely-missing assets still fail loudly.

Same edit at both download sites (build-test-windows.yml and the
identical step in pip-build.yml's macos-pip-build leg... actually
pip-build's manylinux/macos-pip-build legs share the snippet).

Net.WebClient is kept rather than swapping to Invoke-WebRequest
-MaximumRetryCount because PowerShell 6+'s built-in retry only fires
on 408 / 429 / 5xx -- a 404 from S3 wouldn't be retried natively even
with that flag, so manual try/catch is required regardless.
The PowerShell retry-and-extract block was duplicated verbatim in
build-test-windows.yml and pip-build.yml. Move it to
.github/actions/install-msys2-mrbind/action.yml so future tweaks
(retry tuning, checksum verification, mirror fallback, etc.) only
need to land in one place.

The if: condition stays at each call site (composite actions can't
gate themselves at top level). The inline 18-line block in each
workflow becomes a single uses: line.
@Fedr Fedr merged commit c9b8226 into master Apr 27, 2026
25 checks passed
@Fedr Fedr deleted the ci/windows-msys2-download-retry branch April 27, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants