Dev/slurm reliablity updates#3
Conversation
…NIAL into dev/slurm_reliablity_updates
There was a problem hiding this comment.
Pull request overview
This PR refactors SLURM dispatch configuration to better control partitions/nodes, adds stdout/stderr capture for SLURM-launched commands, consolidates Flowy output.txt artifacts into a single debug file, and introduces a helper script to analyze Flowy parquet records.
Changes:
- Replace per-task SLURM config maps with a unified
__task_resources__structure and simplify nodelist handling. - Append command-level stdout/stderr redirection to per-script log files under
$DATA_DIR/logs. - Collect Flowy
output.txtoutputs intoall_outputs.txtand add a standalone parquet scanning script.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| src/genial/loop/full_run_v2.py | Refactors SLURM resources, adds logging redirection, and changes pending-job handling logic. |
| src/genial/ext_plugins/flowy/flowy_launcher_helper.py | Aggregates per-run output.txt files into a single debug artifact. |
| scripts/slurm_scripts/sbatch_dispatch_slurm_temp.sh | Updates SLURM template working directory path. |
| scripts/custom_scripts/check_flowy_data_records.py | Adds a CLI tool to summarize flowy_data_record.parquet content across runs. |
Comments suppressed due to low confidence (1)
src/genial/loop/full_run_v2.py:784
- Typo in the error message: "launher" should be "launcher" to make the exception clearer/searchable.
raise NotImplementedError(f"bulk_flow_dirname {kwargs['bulk_flow_dirname']} is not implemented for the launher")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| python scan_flowy_data_record.py | ||
| python scan_flowy_data_record.py --base /path/to/run |
There was a problem hiding this comment.
The module docstring usage examples refer to scan_flowy_data_record.py, but this file is named check_flowy_data_records.py. Update the usage text so users can run the script as documented.
| python scan_flowy_data_record.py | |
| python scan_flowy_data_record.py --base /path/to/run | |
| python check_flowy_data_records.py | |
| python check_flowy_data_records.py --base /path/to/run |
| # get env var | ||
| DATA_DIR = os.getenv("DATA_DIR") | ||
|
|
||
| DEFAULT_BASE = ( | ||
| f"{DATA_DIR}/output/" | ||
| "multiplier_4bi_8bo_permuti_flowy/flowy_trans_run_12chains_3000steps_gen_iter0" | ||
| ) |
There was a problem hiding this comment.
DEFAULT_BASE is built from the DATA_DIR environment variable without validating it. If DATA_DIR is unset, the default becomes None/output/..., which will later produce confusing "No directories matched" errors. Consider requiring --base when DATA_DIR is missing, or defaulting to a sensible path (e.g., current directory) with a clear error message.
| for k, c in items: | ||
| bar_len = int(round((c / max_count) * bar_width)) if max_count > 0 else 0 | ||
| bar = "#" * bar_len | ||
| bar = '' |
There was a problem hiding this comment.
text_hist computes a bar (bar = "#" * bar_len) but then immediately overwrites it with an empty string, so the histogram output will never include bars. Remove the overwrite (or remove the bar computation entirely) so the function behavior matches its intent.
| bar = '' |
| # make sure logs directory exists | ||
| Path(os.environ.get("DATA_DIR")).joinpath("logs").mkdir(parents=True, exist_ok=True) | ||
| _cmd += f" &> {Path(os.environ.get("DATA_DIR")) / "logs" / (Path(script_object.name).name + ".out")}" |
There was a problem hiding this comment.
The log redirection line has mismatched quotes inside the f-string (nested "DATA_DIR" / "logs" quotes), which will raise a SyntaxError and prevent importing/running this module. Build the log path separately (or use single quotes for the env var lookup) and then interpolate it, and consider shell-quoting the path if it may contain special characters.
| logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Let's wait.") | ||
| #logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.") | ||
| #SlurmDispatcher.cancel_job(job_id) | ||
| #jobs[job_index]["status"] = "CANCELLED" | ||
| # Let the normal failure/cancel flow handle retries/termination | ||
| sleep(5.0) | ||
| continue | ||
| if status and status not in ["PENDING", "RUNNING"]: |
There was a problem hiding this comment.
dispatch_all_jobs no longer cancels jobs that are pending due to ReqNodeNotAvail/UnavailableNodes even when cancel_on_unavailable=True. Since these jobs can remain PENDING indefinitely, the while pending: loop can hang forever and retries will never trigger. Either restore the cancel-and-mark-cancelled behavior, or add a bounded wait/timeout (then cancel/mark failed) so the dispatcher can make progress.
| logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Let's wait.") | |
| #logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.") | |
| #SlurmDispatcher.cancel_job(job_id) | |
| #jobs[job_index]["status"] = "CANCELLED" | |
| # Let the normal failure/cancel flow handle retries/termination | |
| sleep(5.0) | |
| continue | |
| if status and status not in ["PENDING", "RUNNING"]: | |
| logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.") | |
| SlurmDispatcher.cancel_job(job_id) | |
| jobs[job_index]["status"] = "CANCELLED" | |
| # Let the normal failure/cancel flow handle retries/termination | |
| continue | |
| if status and status not in ["PENDING", "RUNNING"]: | |
| if status and status not in ["PENDING", "RUNNING"]: |
| if status == "PENDING" and cancel_on_unavailable: | ||
| sq_state, sq_reason = SlurmDispatcher.get_squeue_state_and_reason(job_id) | ||
| if sq_state == "PENDING" and sq_reason is not None: | ||
| reason_lc = sq_reason.lower() | ||
| if "reqnodenotavail" in reason_lc or "unavailablenodes" in reason_lc: | ||
| logger.warning( | ||
| f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang." | ||
| ) | ||
| SlurmDispatcher.cancel_job(job_id) | ||
| jobs[script]["status"] = "CANCELLED" | ||
| logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Not cancelling") | ||
| #logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.") | ||
| #SlurmDispatcher.cancel_job(job_id) | ||
| #jobs[script]["status"] = "CANCELLED" | ||
| sleep(5.0) | ||
| continue |
There was a problem hiding this comment.
Same issue as dispatch_all_jobs: dispatch_all_scripts logs that an unschedulable job is pending but does not cancel/mark it, so the dispatcher can loop forever with pending never shrinking. If the intent is to "wait", add a maximum pending duration/backoff and then cancel+retry (or fail) to avoid infinite hangs.
| #SBATCH --error="/home/%u/slurm_logs/genial/sbatch_error/genial_flowy_%j_%N_$timestamp.log" | ||
| # Move to working directory | ||
| cd $$HOME/proj/genial | ||
| cd $$HOME/proj/GENIAL |
There was a problem hiding this comment.
This changes the working directory to $HOME/proj/GENIAL, but other SLURM launch scripts in this repo use $HOME/proj/genial (lowercase). On case-sensitive filesystems this will fail at runtime. Make the path consistent (or derive it from $SRC_DIR / repository location) rather than hardcoding a different-cased directory name.
| cd $$HOME/proj/GENIAL | |
| cd "$HOME/proj/genial" |
refactored src/genial/loop/full_run_v2.py
save std/err outputs of all flowy synth runs of an encoding in all_outputs.txt (for debugging)
flowy_trans_run_12chains_3000steps_gen_iter0/synth_out_1000_1/res_00000000005689$ ls
all_outputs.txt
added script to analyze data form flowy synth run