Skip to content

Ac slurmfix#209

Merged
Acribbs merged 11 commits into
masterfrom
AC-slurmfix
Mar 8, 2026
Merged

Ac slurmfix#209
Acribbs merged 11 commits into
masterfrom
AC-slurmfix

Conversation

@Acribbs
Copy link
Copy Markdown
Contributor

@Acribbs Acribbs commented Mar 5, 2026

No description provided.

Acribbs added 4 commits March 5, 2026 19:04
sbatch outputs 'Submitted batch job <id>' but the code was passing the
entire string to sacct, causing 'Bad job/step specified: Submitted'.

Also fix monitor_job_completion:
- Use squeue to poll while job is pending/running (sacct is not
  populated until after the job leaves the queue)
- Fall back to sacct with retry loop for final exit status
- Parse sacct output correctly: --parsable2 returns one line per
  job step, with STATE|EXITCODE format; previously the whole
  multi-line string was compared against single status strings
The run() function was changed in 0.6.18 to always use get_executor(),
which routes all cluster jobs through the new CLI-based SlurmExecutor/
SGEExecutor classes, bypassing GridExecutor (DRMAA) entirely.

This broke all users with a working DRMAA installation.

Fix:
- run() now uses GridExecutor when DRMAA is available and a session is
  active; CLI-based executors are used only as a fallback when DRMAA is
  absent.
- will_run_on_cluster() no longer raises when DRMAA is missing; it
  returns False so the CLI executor fallback can proceed cleanly.
@Acribbs
Copy link
Copy Markdown
Contributor Author

Acribbs commented Mar 5, 2026

tested on my cluster and works

Acribbs and others added 7 commits March 5, 2026 21:15
setup_signal_handlers() was being called in every ruffus worker
subprocess (each one inherits the parent's handler via fork). When
the pipeline exits and SIGTERM is broadcast to the process group,
every worker logs 'Received signal 15. Starting clean-up.' and runs
cleanup in parallel, producing dozens of duplicate log lines.

Fix: skip handler installation in non-main processes using
multiprocessing.current_process().name.
The previous fix (checking for MainProcess) was wrong because gevent
monkey-patches signal.signal() to accumulate watchers rather than
replace them. Every Executor.__init__ call added another watcher, so
N parallel tasks = N handlers firing on every signal.

The correct fix: remove setup_signal_handlers() from __init__ entirely.
It is already called exactly once from control.py:run_workflow(), which
is the right place - once per pipeline run, not once per task.
The handler is installed once in the main process, but ruffus forks N
worker processes (multiprocess=N, default 40 on cluster) *after*
installation, so every worker inherits it.  When the process group
receives SIGTERM on normal pipeline exit, all 40 workers fire the
handler, producing dozens of duplicate log lines.

The previous attempt checked current_process().name at *setup* time,
which is always 'MainProcess' and therefore did nothing.

Correct fix: check inside the handler itself at *fire* time.  Worker
processes have names like 'Process-1', so they return immediately
without logging or running cleanup.
- control.md: replace non-existent run_pipeline()/Pipeline API with
  the real P.main() / pipeline actions / run_workflow() description
- execution.md: document DRMAA-first executor selection, GridExecutor,
  get_executor() fallback, will_run_on_cluster(), and run() options
- cluster.md: fix YAML examples to use nested 'cluster:' key (not flat
  'queue_manager:'); add DRMAA vs CLI section; add Torque example
- installation.md: replace deprecated 'setup.py develop' with
  'pip install -e .'; add DRMAA setup instructions; update Python req to 3.8+
- run_parameters.md: replace duplicate cluster config content with
  actual run-time options (actions, CLI flags, per-task P.run() options)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Acribbs Acribbs merged commit 2e4d6e7 into master Mar 8, 2026
14 checks passed
@Acribbs Acribbs deleted the AC-slurmfix branch March 8, 2026 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant