Skip to content

Fix Ctrl-C segfaults in sage.libs.gap #40613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 36 commits into
base: develop
Choose a base branch
from

Conversation

orlitzky
Copy link
Contributor

@orlitzky orlitzky commented Aug 17, 2025

Replace cysignals' sig_on() and sig_off() with custom gap_sig_on() and gap_sig_off() wrappers that use GAP's own SIGINT handling. This fixes an uncommon, but ultimately reproducible segfault when Ctrl-C is used to interrupt a GAP computation.

In concert with #40594 this has allowed sage/libs/gap/element.pyx to pass many thousands of test iterations.

Fixes

Dependencies

user202729 and others added 10 commits August 16, 2025 22:19
We plan to install InterruptExecStat (gap/stats.h) as the SIGINT and
SIGALRM handler while GAP code is running, since that seems to work
better than mixing cysignals with GAP_Enter/GAP_Leave has.
Add two new pre/post-GAP functions to enable/disable GAP's own SIGINT
(Ctrl-C) handler. These avoid the setjmp/longjmp issues we've
encountered with the sig_on() and sig_off() from cysignals.
Replace the sig_on() and sig_off() wrappers from cysignals with the
new custom gap_sig_on() and gap_sig_off(). We've lost the sig_on() and
sig_off() around GAP_initialize() because I don't think we can trust
GAP's own signal handler until after we've initialized GAP.
Replace the sig_on() and sig_off() wrappers from cysignals with the
new custom gap_sig_on() and gap_sig_off(). This fixes the segfault
that occurs after repeated testing of _pow_.
We are no longer using cysignals for this; instead we have our own
gap_sig_on() and gap_sig_off() functions.
The way we handle signals in libgap is now unusual, so a few
paragraphs have been added to the libgap module docstring explaining
how and why it is unusual.
We've found that GAP's own SIGINT handler is less crashy than if we
mix GAP_Enter/GAP_Leave with cysignals' sig_on and sig_off. We've
installed that same handler for SIGALRM, but the code that catches it
can't tell the difference and converts them both in to
KeyboardInterrupt. To retain the ability to doctest this stuff, we have
to catch those KeyboardInterrupts and poke at them to see if they arose
from GAP.
This function used cysignals for sig_error(), and we have opted not to
mix cysignals code with libgap code. It has in any case been replaced
by plain GAP_Enter calls.
# Ctrl-C
raise KeyboardInterrupt from e
else:
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is some really bad duplication of code. I suggest

  • fold GAP_Enter and GAP_Leave into gap_sig_on and gap_sig_off respectively
  • instead of setting the interrupt handler directly to InterruptExecStat, make custom wrapper function (or hook into cysignals), the custom wrapper function sets a global variable that determine which kind of signal is being raised, then call InterruptExecStat
  • the global variable above need to be reset at appropriate places
  • modify error_handler to, instead of indiscriminately raise GAPError, raise the correct exception by reading the global variable above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's pretty bad but I was trying to keep it as simple as possible for the first iteration. First get it to not crash, then make it pretty.

I don't think we can use a global for the signal type because signals are asynchronous. There's no reason why a SIGINT can't be triggered while handling a SIGALRM, or vice-versa. It's not very likely, but if there's a lesson here it's that we shouldn't count on unlikely things not happening.

We may be able to combine GAP_Enter and gap_sig_on(), though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use a global for the signal type because signals are asynchronous. There's no reason why a SIGINT can't be triggered while handling a SIGALRM, or vice-versa. It's not very likely, but if there's a lesson here it's that we shouldn't count on unlikely things not happening.

you're right but… I think currently the signal handler is calling PyErr_Restore() or something anyway, and that one isn't reentrant either.

Searching up a bit https://stackoverflow.com/a/3127697/, that's indeed a possibility. But doesn't the same thing happen for cysignals as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be able to combine GAP_Enter and gap_sig_on(), though.

Even this immediately leads to a segault.

e.__traceback__ = None
alarm_raised = True
else:
raise
Copy link
Contributor

@user202729 user202729 Aug 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the logic below (raise AlarmInterrupt instead of KeyboardInterrupt) are correctly implemented, this change wouldn't be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've further simplified this down to a one-line change, using the fact that AlarmInterrupt is a subclass of KeyboardInterrupt. Instead of catching AlarmInterrupt, we can catch KeyboardInterrupt, and then check,

  • is it an AlarmInterrupt?
  • does it have "user interrupt" as the message?

I prefer this to hacking AlarmInterrupt into the GAP code at this point because sage.libs.gap is otherwise free of cysignals. To raise the AlarmInterrupt from sage.libs.gap we'd have to,

  1. Add cysignals back as a dep in meson.build
  2. Import AlarmInterrupt
  3. Add the global last_signal variable
  4. Set last_signal whenever a handler is called
  5. Check last_signal when raising a GAPError and convert it to the right thing

Since this is all for the benefit of one doctest method, it just seems easier to add the one line in that doctest method?

@user202729
Copy link
Contributor

user202729 commented Aug 17, 2025

I find the current implementation (before this pull request) questionable also—one shouldn't be calling any Python code between GAP_Enter…GAP_Leave, and if there's no Python code, try…finally isn't necessary.

If I understood correctly, the way the current exception handler works is that it just sets PyErr_Restore and hope that Python checks the exception quickly enough.

One implementation idea:

whenever you need to call GAP API, you only need to

gap_sig_on()
gap_check_exceptions()
<GAP code here, must not contain any Python code, therefore cannot raise any exception>
gap_sig_off()

where you do something like

int gap_enter_ok; // global variable internal to that file
#define gap_sig_on() gap_set_signal_handlers(); gap_enter_ok = GAP_Enter()
#define gap_check_exceptions() if (!gap_enter_ok) { GAP_Leave(); raise_appropriate_exception_type() }
#define gap_sig_off() GAP_Leave(); gap_reset_signal_handler()

Copy link

github-actions bot commented Aug 17, 2025

Documentation preview for this PR (built with commit 8b8c496; changes) is ready! 🎉
This preview will update shortly after each push to this PR.

@orlitzky
Copy link
Contributor Author

I find the current implementation (before this pull request) questionable also—one shouldn't be calling any Python code between GAP_Enter…GAP_Leave, and if there's no Python code, try…finally isn't necessary.

If I understood correctly, the way the current exception handler works is that it just sets PyErr_Restore and hope that Python checks the exception quickly enough.

I think the error handler returns control to GAP, though, which eventually jumps back to GAP_Enter. If I move the GAP_Enter outside of try/except, we won't catch the GAPError that is raised. I left comments on the GAP_Enter lines because this surprised me at first.

@user202729
Copy link
Contributor

I think the error handler returns control to GAP, though, which eventually jumps back to GAP_Enter. If I move the GAP_Enter outside of try/except, we won't catch the GAPError that is raised. I left comments on the GAP_Enter lines because this surprised me at first.

In my understanding, the way it works is the following. GAP_Enter is just a C macro, defined by

cdef extern from "gap/libgap-api.h" nogil:
    cdef int GAP_Enter() except 0

that by itself cannot raise an exception. But PyErr_Restore() can, and the exception is raised the next time Python checks for pending exceptions.

But the problem I see is that this looks dangerous, since to me there doesn't seem to be any guarantee it will be checked right after GAP_Enter call.

Anyway, need to double check.

@orlitzky
Copy link
Contributor Author

int gap_enter_ok; // global variable internal to that file
#define gap_sig_on() gap_set_signal_handlers(); gap_enter_ok = GAP_Enter()
#define gap_check_exceptions() if (!gap_enter_ok) { GAP_Leave(); raise_appropriate_exception_type() }
#define gap_sig_off() GAP_Leave(); gap_reset_signal_handler()

I'm trying a simpler version of this to start, with just gap_set_signal_handlers(), gap_sig_on(), and gap_sig_off() in an .h file, and it's already crashing as soon as I sent Ctrl-C or an alarm. I'm giving up for the night. I'm getting nowhere.

@user202729
Copy link
Contributor

user202729 commented Aug 18, 2025

I guess if you don't understand the segmentation fault very well it would be unfair. Not sure if the rare segmentation fault is worth the code duplication though.

Anyway what is your current implementation?

(Note that the GAP_Enter cannot be nested/hidden in anything, which is why I'm forced to separate the gap_check_exceptions from gap_sig_on and make them macros instead of inline functions.)

By the way, for quick hacking, you can use inline raw C embedded in pxd/pyx file with extern from *, instead of separate .h file.

If we're going to raise a GAPError that resulted from a SIGINT or
SIGALRM, we might as well turn it in to a KeyboardInterrupt right
away.
We are no longer retaining the __cause__ of KeyboardInterrupts
that arise from GAP, but the error message is still there, so
we can use that to filter these out.
We are now converting "GAPError: Error, user interrupt" into a
KeyboardInterrupt in our libgap error handler, so we don't have
to do it at every call site.
Somehow we wound up with a three-space indentation. Make it four.
Include an example of "correct" interruptible libgap usage, and delete
an example of what not to do. Bad examples can be dangerous for people
who are skimming the documentation, and are not that interesting
anyway (wrong code is easy to write).
The check for _pow_ being interruptible by Ctrl-C is subject to some
variation due to CPU usage, scheduling, etc. Here we increase the
tolerance to 5s, which should be more than enough time to interrupt
it, and also increase the size of the problem so that it runs for
at least 5s even on fast machines.
We can check for a GAP boolean using only libgap-api.h by testing for
one of the three objects: GAP_True, GAP_False, GAP_Fail.
We should mention that TNUM_OBJ is safe to use outside of GAP_Enter /
GAP_Leave, and that we are only using it because the alternative is
too slow.
We can use the official C API for the length of a list in
make_any_gap_element() without wasting any time.
@orlitzky orlitzky force-pushed the fix-gap-ctrl-c-segfaults branch from 2b2ffbb to 36da2ca Compare August 21, 2025 16:14
In the GAP_Enter / GAP_Leave example, it's probably better to store
the GapElement in a python variable (rather than a C variable), just
in case python might decide to garbage collect it otherwise.
@orlitzky
Copy link
Contributor Author

orlitzky commented Aug 21, 2025

ensure_interruptible_after() is flaky to begin with because it uses wall time (sensitive to CPU usage, scheduling, etc.).

Is it really? The logic measures the time taken from when the "user" presses Ctrl-C until KeyboardInterrupt is raised.

Yeah, wall time is just fundamentally not what we want to measure. Start ensure_interruptible_after(), hit Ctrl-Z, eat lunch, come back, type fg... it will raise an error because it thinks that the function took half an hour to interrupt. In reality, the signal was sent right away, and the function was interrupted soon thereafter; it's the measurement that's wrong.

Unless the system is really, really busy that the user's computer visibly lags (or there's a bug somewhere else such as #40556), this should be imperceptible.

If my computer isn't lagging, I start more test processes. Seriously though, maybe I just have bad luck, but I see these consistently. Infrequently, but still. Switching it to CPU time is not necessarily a trivial fix either or I would have done it by now.

@orlitzky
Copy link
Contributor Author

(Wall time is an OK way to measure how long a real person has to wait after pressing Ctrl-C, it's just not great for an automated test suite.)

@user202729
Copy link
Contributor

Start ensure_interruptible_after(), hit Ctrl-Z, eat lunch, come back, type fg... it will raise an error because it thinks that the function took half an hour to interrupt.

I don't think this is true? Unless the Ctrl-Z is exactly between the time the signal is sent and the time the interrupt is raised.

Plus, I don't think anyone is Ctrl-Z anything on the CI.

Seriously though, maybe I just have bad luck, but I see these consistently.

okay, I agree this happens on the CI too.

I just look it up that the default time slice is on the order of 1 ms, so if there are 1 CPU and 50 processes busily using the same CPU, something taking 1μs of CPU time has a chance of 1/1000 of being interrupted by a context switch, and if it is interrupted by a context switch the expected time taken is 50ms = 0.05s. The default tolerance now is 0.2s, so in case of being interrupted there's roughly 1/4 chance (Markov's inequality) it could exceed the time limit.

Not sure if my math is correct, but that sounds awfully high as well. Will revisit later. But anyway, not too relevant to this PR.

@user202729
Copy link
Contributor

Hm, possibly. I updated it to x = libgap({'a': 1, 'b': 2}) so that it gets a local python reference. But now I'm not so sure that the xlen = GAP_LenList((<GapElement>x).value) later on is guaranteed to prevent x from being garbage-collected.

it probably should, but I don't have enough understanding of the garbage collector to say. I'd say, as long as the GAP_MarkBag works as intended, then it ought to work.

FWIW the original pattern appears in make_gap_list() and make_gap_matrix(). If it's buggy, we should eventually look at those too.

As above. Maybe it's actually safe because the API functions used happen to not trigger the garbage collector.

@orlitzky
Copy link
Contributor Author

Start ensure_interruptible_after(), hit Ctrl-Z, eat lunch, come back, type fg... it will raise an error because it thinks that the function took half an hour to interrupt.

I don't think this is true? Unless the Ctrl-Z is exactly between the time the signal is sent and the time the interrupt is raised.

start_time = walltime() is set immediately, before alarm(seconds) is called. If you Ctrl-Z it at any point before the alarm is raised, this will be true:

if elapsed > seconds + max_wait_after_interrupt:

because elapsed will include however long it was asleep, and the other two do not.

Plus, I don't think anyone is Ctrl-Z anything on the CI.

I hope not, but my point is just that it's counting seconds that obviously shouldn't be counted.

@user202729
Copy link
Contributor

user202729 commented Aug 21, 2025

start_time = walltime() is set immediately, before alarm(seconds) is called.

Hm… that's an oversight. Probably it was just because the other way is somewhat difficult to implement.

anyway, the rest of the PR looks reasonable to me, but I should double check later.

We have one doctest in this file that is perpetually failing on the CI
because a CBF integral takes too long to interrupt. The computation is
extremely long-running, though, and "too long" is on the order of
deciseconds: everything is fine. This commit increases the wait
tolerance from 0.2s to 5s. That should be long enough to avoid all
random CPU scheduling issues, and is still much shorter than the time
it would take for the integral to complete.
@orlitzky
Copy link
Contributor Author

Funny enough, the one remaining CI failure is for ensure_interruptible_after() in CBF. I've fixed that one too in the same way.

@user202729
Copy link
Contributor

user202729 commented Aug 21, 2025

looking at it again, I'm suspecting that failure is a genuine bug. Currently cysignals do

static inline void sig_unblock(void)
{
⋮
    if (unlikely(cysigs.interrupt_received))
        /* Re-raise the signal if we can handle it now */
        if (cysigs.sig_on_count > 0 && cysigs.block_sigint == 0)
            proc_raise(cysigs.interrupt_received);
}

where

#if HAVE_KILL
#define proc_raise(sig)  kill(getpid(), sig)
#else
/* On Windows, raise() actually signals the process */
#define proc_raise(sig)  raise(sig)
#endif

but since we're sig_unblock() then later immediately sig_block() again, there's no guarantee that the signal will be immediately processed instead of getting blocked again. It seems better to me to let sig_unblock call do_raise_exceptionsiglongjmp as in implementation.c / cysigs_interrupt_handler instead.

there's no way the cos(x)*sin(x) could take 5 seconds to run… actually, the GC could make it slow also, but still should be less than 0.2 seconds, unless there's a huge number of live objects.

@orlitzky
Copy link
Contributor Author

but since we're sig_unblock() then later immediately sig_block() again, there's no guarantee that the signal will be immediately processed instead of getting blocked again. It seems better to me to let sig_unblock call do_raise_exceptionsiglongjmp as in implementation.c / cysigs_interrupt_handler instead.

Indeed, this could also be an issue with this sort of cysignals usage. I'm not sure why this function is using sig_block() and sig_unblock() in the first place -- you're probably not supposed to run them in a loop. Regardless, for me this example "hangs" while signals are unblocked. It usually gets interrupted on time, but if I run ensure_interruptible_after() in a loop 10,000 times, I can get some weird fluctuations. The same is true of _pow_ in GAP now though, and it's not using cysignals, so I think if anything there might be a problem with sig-block-unblock in a loop and large random fluctuations we have to account for.

there's no way the cos(x)*sin(x) could take 5 seconds to run… actually, the GC could make it slow also, but still should be less than 0.2 seconds, unless there's a huge number of live objects.

It's doing cos(x)*sin(x) in a field with 1,000,000 bits of precision. I've never seen it take 5s, but I've seen it take 3s, and we need 5s to be a lot bigger than 3s for it to pass with confidence. It takes a lot longer than 5s to actually finish, so if it stops after 3s, I think that's OK.

I can drop that commit and leave it for a rainy day if you prefer, it's up to you.

@user202729
Copy link
Contributor

user202729 commented Aug 22, 2025

I'm not sure why this function is using sig_block() and sig_unblock() in the first place -- you're probably not supposed to run them in a loop.

to avoid interrupting in the middle of the user callback function which is likely to corrupt the Python state.

I've seen it take 3s

okay, that's surprising. But then maybe that can happen if x is really close to a multiple of π, or something like the table maker's dilemma, or because trigonometric functions are just slow.

@user202729
Copy link
Contributor

user202729 commented Aug 23, 2025

minor documentation formatting issue in rendered HTML https://doc-pr-40613--sagemath.netlify.app/html/en/reference/libs/sage/libs/gap/libgap#using-the-gap-c-library-from-cython (the indentation is interpreted as quotation, and some code blocks aren't interpreted as such because of the lack of ::), but no big deal. You can re-set positive review afterwards.

I may put in some other improvements to remove the string check some day (just store the signum received by the signal handler to some global variable, then read it from the error handler, instead of detecting the user interrupt string).

(and maybe expose some way to customize the class used for alarm interrupt, in order to raise AlarmInterrupt, for consistency with the rest of the code base.)

@user202729
Copy link
Contributor

actually I think you want to rebase it on top of #40594 (instead of cherry-pick the first 2 commits as in the current situation) to avoid polluting the commit history.

These never should have been indented in the first place; the whole
block is being rendered as a quote.
@orlitzky
Copy link
Contributor Author

actually I think you want to rebase it on top of #40594 (instead of cherry-pick the first 2 commits as in the current situation) to avoid polluting the commit history.

git merge and git rebase will skip the two redundant commits so long as your branch is merged first. I added your PR as a dependency of this one in the description to ensure that the order is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants