Skip to content

Fix rare SIGABRTs due to missing libgcc_s.so.1#765

Merged
jaybosamiya-ms merged 2 commits intomainfrom
jayb/fix-unwinding-flakiness
Apr 13, 2026
Merged

Fix rare SIGABRTs due to missing libgcc_s.so.1#765
jaybosamiya-ms merged 2 commits intomainfrom
jayb/fix-unwinding-flakiness

Conversation

@jaybosamiya-ms
Copy link
Copy Markdown
Member

We have a somewhat rare flaky CI for iperf3. This is caused because pthread_cancel does a dlopen of libgcc_s.so.1 which does not exist inside the sandbox (because ldd does not report it for iperf3). This, in rare cases, causes a SIGABRT. As I understand it, it is triggered only if the iperf3 threads have not spun down fast enough, which is why it is flaky rather than a consistent failure.

Relevant reference: https://sourceware.org/pipermail/libc-help/2009-October/001071.html

@jaybosamiya-ms jaybosamiya-ms marked this pull request as ready for review April 10, 2026 23:04
@jaybosamiya-ms jaybosamiya-ms marked this pull request as draft April 11, 2026 00:05
auto-merge was automatically disabled April 11, 2026 00:05

Pull request was converted to draft

@jaybosamiya-ms jaybosamiya-ms force-pushed the jayb/fix-unwinding-flakiness branch from 2e22d4c to 2cba912 Compare April 11, 2026 00:19
@github-actions
Copy link
Copy Markdown

🤖 SemverChecks 🤖 No breaking API changes detected

Note: this does not mean API is unchanged, or even that there are no breaking changes; simply, none of the detections triggered.

@jaybosamiya-ms jaybosamiya-ms marked this pull request as ready for review April 11, 2026 00:29
Copy link
Copy Markdown
Contributor

@CvvT CvvT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaybosamiya-ms for the fix! I wonder how you figure out the root cause. Also, we may extend the fix to search for all potential libs loaded at runtime, e.g.,

strings /lib/x86_64-linux-gnu/libc.so.6 | grep gcc
libgcc_s.so.1
__gcc_personality_v0
libgcc_s.so.1 must be installed for unwinding to work
libgcc_s.so.1 must be installed for pthread_cancel to work
libgcc_s.so.1 must be installed for pthread_exit to work
.gcc_except_table

We don't need to do it now, just leave a comment for future reference.

@jaybosamiya-ms jaybosamiya-ms added this pull request to the merge queue Apr 13, 2026
Merged via the queue into main with commit e174357 Apr 13, 2026
14 checks passed
@jaybosamiya-ms jaybosamiya-ms deleted the jayb/fix-unwinding-flakiness branch April 13, 2026 18:06
@jaybosamiya-ms
Copy link
Copy Markdown
Member Author

Root cause: the CI jobs where it failed had this somewhat consistently in the error logs (e.g., https://github.com/microsoft/litebox/actions/runs/24112381641/job/70349522828):

  stderr ───
    iperf3: error - unable to connect to server - server may have stopped running or use a different port, firewall issue, etc.: Connection refused
    iperf3 client attempt 1 failed, retrying
    WARNING: unsupported: clock_gettime(clockid = 2)
    WARNING: unsupported: unsupported syscall getrusage
    iperf3: getsockopt - Operation not supported
    WARNING: unsupported: clock_gettime(clockid = 2)
    WARNING: unsupported: unsupported syscall getrusage
    iperf3: getsockopt - Operation not supported
    libgcc_s.so.1 must be installed for pthread_cancel to work
    -- Fatal signal Signal(6): terminating task 12249:12249

    thread 'test_tun_and_runner_with_iperf3' (12055) panicked at litebox_runner_linux_userland/tests/run.rs:184:9:
    failed to run litebox_runner_linux_userland: exit status: 6
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

So libgcc_s.so.1 + pthread_cancel was already suspicious there, and I was already considering adding it in. It is also suspicious for the signal to be 6 (i.e., SIGABRT), which should be rarer in unwinding-active compilations of Rust programs.

Unfortunately, I was unable to reproduce that nicely enough locally with like 100 executions in a loop, so I gave the above error message + instructions to try to get a rr record recording of the error to an Opus agent. It ran overnight, and was able to find a repro case. Automatically it decided to debug it and come up with a fix, and tested the fix again with hundreds/thousands of runs, and was unable to repro it anymore. I did not like the fix it suggested since it was just a hardcoded one-off fix, so I manually generalized the fix, and also applied the same fix to dev_bench. But yeah, once it had confirmed that the issue was indeed due to libgcc_s, I did a few searches and was able to find the above reference link which provides a bit of context.


Good point to do a better search for runtime-loaded libs, I'll open an issue to track it, thanks @CvvT!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants