Ac slurmfix#209
Merged
Merged
Conversation
sbatch outputs 'Submitted batch job <id>' but the code was passing the entire string to sacct, causing 'Bad job/step specified: Submitted'. Also fix monitor_job_completion: - Use squeue to poll while job is pending/running (sacct is not populated until after the job leaves the queue) - Fall back to sacct with retry loop for final exit status - Parse sacct output correctly: --parsable2 returns one line per job step, with STATE|EXITCODE format; previously the whole multi-line string was compared against single status strings
The run() function was changed in 0.6.18 to always use get_executor(), which routes all cluster jobs through the new CLI-based SlurmExecutor/ SGEExecutor classes, bypassing GridExecutor (DRMAA) entirely. This broke all users with a working DRMAA installation. Fix: - run() now uses GridExecutor when DRMAA is available and a session is active; CLI-based executors are used only as a fallback when DRMAA is absent. - will_run_on_cluster() no longer raises when DRMAA is missing; it returns False so the CLI executor fallback can proceed cleanly.
Contributor
Author
|
tested on my cluster and works |
setup_signal_handlers() was being called in every ruffus worker subprocess (each one inherits the parent's handler via fork). When the pipeline exits and SIGTERM is broadcast to the process group, every worker logs 'Received signal 15. Starting clean-up.' and runs cleanup in parallel, producing dozens of duplicate log lines. Fix: skip handler installation in non-main processes using multiprocessing.current_process().name.
The previous fix (checking for MainProcess) was wrong because gevent monkey-patches signal.signal() to accumulate watchers rather than replace them. Every Executor.__init__ call added another watcher, so N parallel tasks = N handlers firing on every signal. The correct fix: remove setup_signal_handlers() from __init__ entirely. It is already called exactly once from control.py:run_workflow(), which is the right place - once per pipeline run, not once per task.
The handler is installed once in the main process, but ruffus forks N worker processes (multiprocess=N, default 40 on cluster) *after* installation, so every worker inherits it. When the process group receives SIGTERM on normal pipeline exit, all 40 workers fire the handler, producing dozens of duplicate log lines. The previous attempt checked current_process().name at *setup* time, which is always 'MainProcess' and therefore did nothing. Correct fix: check inside the handler itself at *fire* time. Worker processes have names like 'Process-1', so they return immediately without logging or running cleanup.
- control.md: replace non-existent run_pipeline()/Pipeline API with the real P.main() / pipeline actions / run_workflow() description - execution.md: document DRMAA-first executor selection, GridExecutor, get_executor() fallback, will_run_on_cluster(), and run() options - cluster.md: fix YAML examples to use nested 'cluster:' key (not flat 'queue_manager:'); add DRMAA vs CLI section; add Torque example - installation.md: replace deprecated 'setup.py develop' with 'pip install -e .'; add DRMAA setup instructions; update Python req to 3.8+ - run_parameters.md: replace duplicate cluster config content with actual run-time options (actions, CLI flags, per-task P.run() options) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.