Skip to content

Terminate process group after cram execution #11841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Leonidas-from-XIV
Copy link
Collaborator

@Leonidas-from-XIV Leonidas-from-XIV commented May 23, 2025

After discussing an alternate solution for #11827 with @rgrinberg we agreed that the best course of action is to just terminate the entire process group of the cram test after execution.

This PR reuses @Alizter's cram test repro case.

It introduces a slight race condition that we might be terminating an unrelated process group that accidentally got the same PID as the cram test. I am unaware of a way to avoid it, the chance is not substantial as OSes usually increment PIDs until they wrap around. In any case the engine is already terminating process groups if the process to be waited on has a timeout, so this code just extends it to the case where there is no timeout.

Fixes #11820

@Leonidas-from-XIV
Copy link
Collaborator Author

Surprising behavior on Windows when we kill_process_group which attempts to Unix.kill the PID which somehow causes corruption in the files that ocamldep writes. That's odd because at that time we have awaited the process so it should have exited already.

@Alizter
Copy link
Collaborator

Alizter commented May 24, 2025

@Leonidas-from-XIV that sounds to me like we have a race condition. Is it similar to the issue in #11010?

@rgrinberg
Copy link
Member

Seems like there's a failure in CI. Is it something relevant?

@Leonidas-from-XIV
Copy link
Collaborator Author

I'm trying to figure out. It works reliably on my system (only failures are from Melange), CI sometimes hangs forever which is unrelated to this PR but its hard to say whether the failure is related since I can't reproduce it.

@Alizter
Copy link
Collaborator

Alizter commented Jun 11, 2025

@Leonidas-from-XIV Could you push a temporary commit disabling all of exec-watch and perhaps watching. Those would be my goto culprits for tests that doesn't terminate.

Alizter and others added 6 commits June 19, 2025 11:41
Signed-off-by: Ali Caglayan <[email protected]>
Signed-off-by: Marek Kubica <[email protected]>
Signed-off-by: Marek Kubica <[email protected]>
Not quite sure why this would make a difference, because by the time
`kill_process_group` is called on the PID the PID should be terminated
and the operation a no-op, but CI is failing on Windows, so this commit
skips the call and checks if that makes a difference.

Signed-off-by: Marek Kubica <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Background tasks in Cram tests are not terminated
3 participants