-
-
Notifications
You must be signed in to change notification settings - Fork 652
Fix Ctrl-C segfaults in sage.libs.gap #40613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
We plan to install InterruptExecStat (gap/stats.h) as the SIGINT and SIGALRM handler while GAP code is running, since that seems to work better than mixing cysignals with GAP_Enter/GAP_Leave has.
Add two new pre/post-GAP functions to enable/disable GAP's own SIGINT (Ctrl-C) handler. These avoid the setjmp/longjmp issues we've encountered with the sig_on() and sig_off() from cysignals.
Replace the sig_on() and sig_off() wrappers from cysignals with the new custom gap_sig_on() and gap_sig_off(). We've lost the sig_on() and sig_off() around GAP_initialize() because I don't think we can trust GAP's own signal handler until after we've initialized GAP.
Replace the sig_on() and sig_off() wrappers from cysignals with the new custom gap_sig_on() and gap_sig_off(). This fixes the segfault that occurs after repeated testing of _pow_.
We are no longer using cysignals for this; instead we have our own gap_sig_on() and gap_sig_off() functions.
The way we handle signals in libgap is now unusual, so a few paragraphs have been added to the libgap module docstring explaining how and why it is unusual.
We've found that GAP's own SIGINT handler is less crashy than if we mix GAP_Enter/GAP_Leave with cysignals' sig_on and sig_off. We've installed that same handler for SIGALRM, but the code that catches it can't tell the difference and converts them both in to KeyboardInterrupt. To retain the ability to doctest this stuff, we have to catch those KeyboardInterrupts and poke at them to see if they arose from GAP.
This function used cysignals for sig_error(), and we have opted not to mix cysignals code with libgap code. It has in any case been replaced by plain GAP_Enter calls.
src/sage/libs/gap/element.pyx
Outdated
# Ctrl-C | ||
raise KeyboardInterrupt from e | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is some really bad duplication of code. I suggest
- fold
GAP_Enter
andGAP_Leave
intogap_sig_on
andgap_sig_off
respectively - instead of setting the interrupt handler directly to
InterruptExecStat
, make custom wrapper function (or hook into cysignals), the custom wrapper function sets a global variable that determine which kind of signal is being raised, then callInterruptExecStat
- the global variable above need to be reset at appropriate places
- modify
error_handler
to, instead of indiscriminately raiseGAPError
, raise the correct exception by reading the global variable above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's pretty bad but I was trying to keep it as simple as possible for the first iteration. First get it to not crash, then make it pretty.
I don't think we can use a global for the signal type because signals are asynchronous. There's no reason why a SIGINT
can't be triggered while handling a SIGALRM
, or vice-versa. It's not very likely, but if there's a lesson here it's that we shouldn't count on unlikely things not happening.
We may be able to combine GAP_Enter
and gap_sig_on()
, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can use a global for the signal type because signals are asynchronous. There's no reason why a
SIGINT
can't be triggered while handling aSIGALRM
, or vice-versa. It's not very likely, but if there's a lesson here it's that we shouldn't count on unlikely things not happening.
you're right but… I think currently the signal handler is calling PyErr_Restore()
or something anyway, and that one isn't reentrant either.
Searching up a bit https://stackoverflow.com/a/3127697/, that's indeed a possibility. But doesn't the same thing happen for cysignals as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may be able to combine
GAP_Enter
andgap_sig_on()
, though.
Even this immediately leads to a segault.
e.__traceback__ = None | ||
alarm_raised = True | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the logic below (raise AlarmInterrupt instead of KeyboardInterrupt) are correctly implemented, this change wouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've further simplified this down to a one-line change, using the fact that AlarmInterrupt
is a subclass of KeyboardInterrupt
. Instead of catching AlarmInterrupt
, we can catch KeyboardInterrupt
, and then check,
- is it an
AlarmInterrupt
? - does it have "user interrupt" as the message?
I prefer this to hacking AlarmInterrupt
into the GAP code at this point because sage.libs.gap
is otherwise free of cysignals. To raise the AlarmInterrupt
from sage.libs.gap
we'd have to,
- Add cysignals back as a dep in
meson.build
- Import
AlarmInterrupt
- Add the global
last_signal
variable - Set
last_signal
whenever a handler is called - Check
last_signal
when raising aGAPError
and convert it to the right thing
Since this is all for the benefit of one doctest method, it just seems easier to add the one line in that doctest method?
I find the current implementation (before this pull request) questionable also—one shouldn't be calling any Python code between GAP_Enter…GAP_Leave, and if there's no Python code, try…finally isn't necessary. If I understood correctly, the way the current exception handler works is that it just sets One implementation idea: whenever you need to call GAP API, you only need to
where you do something like
|
Documentation preview for this PR (built with commit 8b8c496; changes) is ready! 🎉 |
I think the error handler returns control to GAP, though, which eventually jumps back to |
In my understanding, the way it works is the following.
that by itself cannot raise an exception. But But the problem I see is that this looks dangerous, since to me there doesn't seem to be any guarantee it will be checked right after Anyway, need to double check. |
I'm trying a simpler version of this to start, with just |
I guess if you don't understand the segmentation fault very well it would be unfair. Not sure if the rare segmentation fault is worth the code duplication though. Anyway what is your current implementation? (Note that the By the way, for quick hacking, you can use inline raw C embedded in pxd/pyx file with |
If we're going to raise a GAPError that resulted from a SIGINT or SIGALRM, we might as well turn it in to a KeyboardInterrupt right away.
We are no longer retaining the __cause__ of KeyboardInterrupts that arise from GAP, but the error message is still there, so we can use that to filter these out.
We are now converting "GAPError: Error, user interrupt" into a KeyboardInterrupt in our libgap error handler, so we don't have to do it at every call site.
Somehow we wound up with a three-space indentation. Make it four.
Include an example of "correct" interruptible libgap usage, and delete an example of what not to do. Bad examples can be dangerous for people who are skimming the documentation, and are not that interesting anyway (wrong code is easy to write).
The check for _pow_ being interruptible by Ctrl-C is subject to some variation due to CPU usage, scheduling, etc. Here we increase the tolerance to 5s, which should be more than enough time to interrupt it, and also increase the size of the problem so that it runs for at least 5s even on fast machines.
We can check for a GAP boolean using only libgap-api.h by testing for one of the three objects: GAP_True, GAP_False, GAP_Fail.
We should mention that TNUM_OBJ is safe to use outside of GAP_Enter / GAP_Leave, and that we are only using it because the alternative is too slow.
We can use the official C API for the length of a list in make_any_gap_element() without wasting any time.
2b2ffbb
to
36da2ca
Compare
In the GAP_Enter / GAP_Leave example, it's probably better to store the GapElement in a python variable (rather than a C variable), just in case python might decide to garbage collect it otherwise.
Yeah, wall time is just fundamentally not what we want to measure. Start
If my computer isn't lagging, I start more test processes. Seriously though, maybe I just have bad luck, but I see these consistently. Infrequently, but still. Switching it to CPU time is not necessarily a trivial fix either or I would have done it by now. |
(Wall time is an OK way to measure how long a real person has to wait after pressing Ctrl-C, it's just not great for an automated test suite.) |
I don't think this is true? Unless the Ctrl-Z is exactly between the time the signal is sent and the time the interrupt is raised. Plus, I don't think anyone is Ctrl-Z anything on the CI.
okay, I agree this happens on the CI too. I just look it up that the default time slice is on the order of 1 ms, so if there are 1 CPU and 50 processes busily using the same CPU, something taking 1μs of CPU time has a chance of 1/1000 of being interrupted by a context switch, and if it is interrupted by a context switch the expected time taken is 50ms = 0.05s. The default tolerance now is 0.2s, so in case of being interrupted there's roughly 1/4 chance (Markov's inequality) it could exceed the time limit. Not sure if my math is correct, but that sounds awfully high as well. Will revisit later. But anyway, not too relevant to this PR. |
it probably should, but I don't have enough understanding of the garbage collector to say. I'd say, as long as the
As above. Maybe it's actually safe because the API functions used happen to not trigger the garbage collector. |
if elapsed > seconds + max_wait_after_interrupt: because
I hope not, but my point is just that it's counting seconds that obviously shouldn't be counted. |
Hm… that's an oversight. Probably it was just because the other way is somewhat difficult to implement. anyway, the rest of the PR looks reasonable to me, but I should double check later. |
We have one doctest in this file that is perpetually failing on the CI because a CBF integral takes too long to interrupt. The computation is extremely long-running, though, and "too long" is on the order of deciseconds: everything is fine. This commit increases the wait tolerance from 0.2s to 5s. That should be long enough to avoid all random CPU scheduling issues, and is still much shorter than the time it would take for the integral to complete.
Funny enough, the one remaining CI failure is for |
looking at it again, I'm suspecting that failure is a genuine bug. Currently cysignals do
where
but since we're there's no way the |
Indeed, this could also be an issue with this sort of cysignals usage. I'm not sure why this function is using
It's doing I can drop that commit and leave it for a rainy day if you prefer, it's up to you. |
to avoid interrupting in the middle of the user callback function which is likely to corrupt the Python state.
okay, that's surprising. But then maybe that can happen if x is really close to a multiple of π, or something like the table maker's dilemma, or because trigonometric functions are just slow. |
minor documentation formatting issue in rendered HTML https://doc-pr-40613--sagemath.netlify.app/html/en/reference/libs/sage/libs/gap/libgap#using-the-gap-c-library-from-cython (the indentation is interpreted as quotation, and some code blocks aren't interpreted as such because of the lack of I may put in some other improvements to remove the string check some day (just store the (and maybe expose some way to customize the class used for alarm interrupt, in order to raise |
actually I think you want to rebase it on top of #40594 (instead of cherry-pick the first 2 commits as in the current situation) to avoid polluting the commit history. |
These never should have been indented in the first place; the whole block is being rendered as a quote.
This is being rendered as a quote.
|
Replace cysignals'
sig_on()
andsig_off()
with customgap_sig_on()
andgap_sig_off()
wrappers that use GAP's ownSIGINT
handling. This fixes an uncommon, but ultimately reproducible segfault when Ctrl-C is used to interrupt a GAP computation.In concert with #40594 this has allowed
sage/libs/gap/element.pyx
to pass many thousands of test iterations.Fixes
InterruptExecStat
for GAP interrupt #40598Dependencies