Skip to content

Dev/slurm reliablity updates#3

Open
plex1 wants to merge 6 commits into
mainfrom
dev/slurm_reliablity_updates
Open

Dev/slurm reliablity updates#3
plex1 wants to merge 6 commits into
mainfrom
dev/slurm_reliablity_updates

Conversation

@plex1
Copy link
Copy Markdown
Collaborator

@plex1 plex1 commented Feb 25, 2026

  • refactored src/genial/loop/full_run_v2.py

    • mainly to better control select nodes and partitions
    • added some more logging for std/err output of the python commands. Logs are written here: {Path(os.environ.get("DATA_DIR")) / "logs" / (Path(script_object.name).name + ".out")}"
  • save std/err outputs of all flowy synth runs of an encoding in all_outputs.txt (for debugging)

    • saved here:
      flowy_trans_run_12chains_3000steps_gen_iter0/synth_out_1000_1/res_00000000005689$ ls
      all_outputs.txt
  • added script to analyze data form flowy synth run

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors SLURM dispatch configuration to better control partitions/nodes, adds stdout/stderr capture for SLURM-launched commands, consolidates Flowy output.txt artifacts into a single debug file, and introduces a helper script to analyze Flowy parquet records.

Changes:

  • Replace per-task SLURM config maps with a unified __task_resources__ structure and simplify nodelist handling.
  • Append command-level stdout/stderr redirection to per-script log files under $DATA_DIR/logs.
  • Collect Flowy output.txt outputs into all_outputs.txt and add a standalone parquet scanning script.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
src/genial/loop/full_run_v2.py Refactors SLURM resources, adds logging redirection, and changes pending-job handling logic.
src/genial/ext_plugins/flowy/flowy_launcher_helper.py Aggregates per-run output.txt files into a single debug artifact.
scripts/slurm_scripts/sbatch_dispatch_slurm_temp.sh Updates SLURM template working directory path.
scripts/custom_scripts/check_flowy_data_records.py Adds a CLI tool to summarize flowy_data_record.parquet content across runs.
Comments suppressed due to low confidence (1)

src/genial/loop/full_run_v2.py:784

  • Typo in the error message: "launher" should be "launcher" to make the exception clearer/searchable.
            raise NotImplementedError(f"bulk_flow_dirname {kwargs['bulk_flow_dirname']} is not implemented for the launher")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +9 to +10
python scan_flowy_data_record.py
python scan_flowy_data_record.py --base /path/to/run
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring usage examples refer to scan_flowy_data_record.py, but this file is named check_flowy_data_records.py. Update the usage text so users can run the script as documented.

Suggested change
python scan_flowy_data_record.py
python scan_flowy_data_record.py --base /path/to/run
python check_flowy_data_records.py
python check_flowy_data_records.py --base /path/to/run

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +29
# get env var
DATA_DIR = os.getenv("DATA_DIR")

DEFAULT_BASE = (
f"{DATA_DIR}/output/"
"multiplier_4bi_8bo_permuti_flowy/flowy_trans_run_12chains_3000steps_gen_iter0"
)
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_BASE is built from the DATA_DIR environment variable without validating it. If DATA_DIR is unset, the default becomes None/output/..., which will later produce confusing "No directories matched" errors. Consider requiring --base when DATA_DIR is missing, or defaulting to a sensible path (e.g., current directory) with a clear error message.

Copilot uses AI. Check for mistakes.
for k, c in items:
bar_len = int(round((c / max_count) * bar_width)) if max_count > 0 else 0
bar = "#" * bar_len
bar = ''
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text_hist computes a bar (bar = "#" * bar_len) but then immediately overwrites it with an empty string, so the histogram output will never include bars. Remove the overwrite (or remove the bar computation entirely) so the function behavior matches its intent.

Suggested change
bar = ''

Copilot uses AI. Check for mistakes.
Comment on lines +232 to +234
# make sure logs directory exists
Path(os.environ.get("DATA_DIR")).joinpath("logs").mkdir(parents=True, exist_ok=True)
_cmd += f" &> {Path(os.environ.get("DATA_DIR")) / "logs" / (Path(script_object.name).name + ".out")}"
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log redirection line has mismatched quotes inside the f-string (nested "DATA_DIR" / "logs" quotes), which will raise a SyntaxError and prevent importing/running this module. Build the log path separately (or use single quotes for the env var lookup) and then interpolate it, and consider shell-quoting the path if it may contain special characters.

Copilot uses AI. Check for mistakes.
Comment on lines +423 to 430
logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Let's wait.")
#logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.")
#SlurmDispatcher.cancel_job(job_id)
#jobs[job_index]["status"] = "CANCELLED"
# Let the normal failure/cancel flow handle retries/termination
sleep(5.0)
continue
if status and status not in ["PENDING", "RUNNING"]:
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatch_all_jobs no longer cancels jobs that are pending due to ReqNodeNotAvail/UnavailableNodes even when cancel_on_unavailable=True. Since these jobs can remain PENDING indefinitely, the while pending: loop can hang forever and retries will never trigger. Either restore the cancel-and-mark-cancelled behavior, or add a bounded wait/timeout (then cancel/mark failed) so the dispatcher can make progress.

Suggested change
logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Let's wait.")
#logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.")
#SlurmDispatcher.cancel_job(job_id)
#jobs[job_index]["status"] = "CANCELLED"
# Let the normal failure/cancel flow handle retries/termination
sleep(5.0)
continue
if status and status not in ["PENDING", "RUNNING"]:
logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.")
SlurmDispatcher.cancel_job(job_id)
jobs[job_index]["status"] = "CANCELLED"
# Let the normal failure/cancel flow handle retries/termination
continue
if status and status not in ["PENDING", "RUNNING"]:
if status and status not in ["PENDING", "RUNNING"]:

Copilot uses AI. Check for mistakes.
Comment on lines 480 to 490
if status == "PENDING" and cancel_on_unavailable:
sq_state, sq_reason = SlurmDispatcher.get_squeue_state_and_reason(job_id)
if sq_state == "PENDING" and sq_reason is not None:
reason_lc = sq_reason.lower()
if "reqnodenotavail" in reason_lc or "unavailablenodes" in reason_lc:
logger.warning(
f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang."
)
SlurmDispatcher.cancel_job(job_id)
jobs[script]["status"] = "CANCELLED"
logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Not cancelling")
#logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.")
#SlurmDispatcher.cancel_job(job_id)
#jobs[script]["status"] = "CANCELLED"
sleep(5.0)
continue
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as dispatch_all_jobs: dispatch_all_scripts logs that an unschedulable job is pending but does not cancel/mark it, so the dispatcher can loop forever with pending never shrinking. If the intent is to "wait", add a maximum pending duration/backoff and then cancel+retry (or fail) to avoid infinite hangs.

Copilot uses AI. Check for mistakes.
#SBATCH --error="/home/%u/slurm_logs/genial/sbatch_error/genial_flowy_%j_%N_$timestamp.log"
# Move to working directory
cd $$HOME/proj/genial
cd $$HOME/proj/GENIAL
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the working directory to $HOME/proj/GENIAL, but other SLURM launch scripts in this repo use $HOME/proj/genial (lowercase). On case-sensitive filesystems this will fail at runtime. Make the path consistent (or derive it from $SRC_DIR / repository location) rather than hardcoding a different-cased directory name.

Suggested change
cd $$HOME/proj/GENIAL
cd "$HOME/proj/genial"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants