Dev/slurm reliablity updates by plex1 · Pull Request #3 · huawei-csl/GENIAL

plex1 · 2026-02-25T09:11:24Z

refactored src/genial/loop/full_run_v2.py
- mainly to better control select nodes and partitions
- added some more logging for std/err output of the python commands. Logs are written here: {Path(os.environ.get("DATA_DIR")) / "logs" / (Path(script_object.name).name + ".out")}"
save std/err outputs of all flowy synth runs of an encoding in all_outputs.txt (for debugging)
- saved here:
  flowy_trans_run_12chains_3000steps_gen_iter0/synth_out_1000_1/res_00000000005689$ ls
  all_outputs.txt
added script to analyze data form flowy synth run

…NIAL into dev/slurm_reliablity_updates

Copilot

Pull request overview

This PR refactors SLURM dispatch configuration to better control partitions/nodes, adds stdout/stderr capture for SLURM-launched commands, consolidates Flowy output.txt artifacts into a single debug file, and introduces a helper script to analyze Flowy parquet records.

Changes:

Replace per-task SLURM config maps with a unified __task_resources__ structure and simplify nodelist handling.
Append command-level stdout/stderr redirection to per-script log files under $DATA_DIR/logs.
Collect Flowy output.txt outputs into all_outputs.txt and add a standalone parquet scanning script.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
src/genial/loop/full_run_v2.py	Refactors SLURM resources, adds logging redirection, and changes pending-job handling logic.
src/genial/ext_plugins/flowy/flowy_launcher_helper.py	Aggregates per-run `output.txt` files into a single debug artifact.
scripts/slurm_scripts/sbatch_dispatch_slurm_temp.sh	Updates SLURM template working directory path.
scripts/custom_scripts/check_flowy_data_records.py	Adds a CLI tool to summarize `flowy_data_record.parquet` content across runs.

Comments suppressed due to low confidence (1)

src/genial/loop/full_run_v2.py:784

Typo in the error message: "launher" should be "launcher" to make the exception clearer/searchable.

            raise NotImplementedError(f"bulk_flow_dirname {kwargs['bulk_flow_dirname']} is not implemented for the launher")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T09:16:40Z

+  python scan_flowy_data_record.py
+  python scan_flowy_data_record.py --base /path/to/run


The module docstring usage examples refer to scan_flowy_data_record.py, but this file is named check_flowy_data_records.py. Update the usage text so users can run the script as documented.

Suggested change

python scan_flowy_data_record.py

python scan_flowy_data_record.py --base /path/to/run

python check_flowy_data_records.py

python check_flowy_data_records.py --base /path/to/run

Copilot · 2026-02-25T09:16:41Z

+# get env var
+DATA_DIR = os.getenv("DATA_DIR")
+
+DEFAULT_BASE = (
+    f"{DATA_DIR}/output/"
+    "multiplier_4bi_8bo_permuti_flowy/flowy_trans_run_12chains_3000steps_gen_iter0"
+)


DEFAULT_BASE is built from the DATA_DIR environment variable without validating it. If DATA_DIR is unset, the default becomes None/output/..., which will later produce confusing "No directories matched" errors. Consider requiring --base when DATA_DIR is missing, or defaulting to a sensible path (e.g., current directory) with a clear error message.

Copilot · 2026-02-25T09:16:41Z

+    for k, c in items:
+        bar_len = int(round((c / max_count) * bar_width)) if max_count > 0 else 0
+        bar = "#" * bar_len
+        bar = ''


text_hist computes a bar (bar = "#" * bar_len) but then immediately overwrites it with an empty string, so the histogram output will never include bars. Remove the overwrite (or remove the bar computation entirely) so the function behavior matches its intent.

Suggested change

bar = ''

Copilot · 2026-02-25T09:16:41Z

+                # make sure logs directory exists
+                Path(os.environ.get("DATA_DIR")).joinpath("logs").mkdir(parents=True, exist_ok=True)
+                _cmd += f" &> {Path(os.environ.get("DATA_DIR")) / "logs" / (Path(script_object.name).name + ".out")}"


The log redirection line has mismatched quotes inside the f-string (nested "DATA_DIR" / "logs" quotes), which will raise a SyntaxError and prevent importing/running this module. Build the log path separately (or use single quotes for the env var lookup) and then interpolate it, and consider shell-quoting the path if it may contain special characters.

Copilot · 2026-02-25T09:16:41Z

+                                logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Let's wait.")
+                                #logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.")
+                                #SlurmDispatcher.cancel_job(job_id)
+                                #jobs[job_index]["status"] = "CANCELLED"
                                # Let the normal failure/cancel flow handle retries/termination
+                                sleep(5.0)
                                continue
                    if status and status not in ["PENDING", "RUNNING"]:


dispatch_all_jobs no longer cancels jobs that are pending due to ReqNodeNotAvail/UnavailableNodes even when cancel_on_unavailable=True. Since these jobs can remain PENDING indefinitely, the while pending: loop can hang forever and retries will never trigger. Either restore the cancel-and-mark-cancelled behavior, or add a bounded wait/timeout (then cancel/mark failed) so the dispatcher can make progress.

Suggested change

logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Let's wait.")

#logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.")

#SlurmDispatcher.cancel_job(job_id)

#jobs[job_index]["status"] = "CANCELLED"

# Let the normal failure/cancel flow handle retries/termination

sleep(5.0)

continue

if status and status not in ["PENDING", "RUNNING"]:

logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.")

SlurmDispatcher.cancel_job(job_id)

jobs[job_index]["status"] = "CANCELLED"

# Let the normal failure/cancel flow handle retries/termination

continue

if status and status not in ["PENDING", "RUNNING"]:

if status and status not in ["PENDING", "RUNNING"]:

Copilot · 2026-02-25T09:16:42Z

                    if status == "PENDING" and cancel_on_unavailable:
                        sq_state, sq_reason = SlurmDispatcher.get_squeue_state_and_reason(job_id)
                        if sq_state == "PENDING" and sq_reason is not None:
                            reason_lc = sq_reason.lower()
                            if "reqnodenotavail" in reason_lc or "unavailablenodes" in reason_lc:
-                                logger.warning(
-                                    f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang."
-                                )
-                                SlurmDispatcher.cancel_job(job_id)
-                                jobs[script]["status"] = "CANCELLED"
+                                logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Not cancelling")
+                                #logger.warning(f"Job {job_id} appears unschedulable (reason: {sq_reason}). Cancelling to avoid hang.")
+                                #SlurmDispatcher.cancel_job(job_id)
+                                #jobs[script]["status"] = "CANCELLED"
+                                sleep(5.0)
                                continue


Same issue as dispatch_all_jobs: dispatch_all_scripts logs that an unschedulable job is pending but does not cancel/mark it, so the dispatcher can loop forever with pending never shrinking. If the intent is to "wait", add a maximum pending duration/backoff and then cancel+retry (or fail) to avoid infinite hangs.

Copilot · 2026-02-25T09:16:42Z

 #SBATCH --error="/home/%u/slurm_logs/genial/sbatch_error/genial_flowy_%j_%N_$timestamp.log"
 # Move to working directory
-cd $$HOME/proj/genial
+cd $$HOME/proj/GENIAL


This changes the working directory to $HOME/proj/GENIAL, but other SLURM launch scripts in this repo use $HOME/proj/genial (lowercase). On case-sensitive filesystems this will fail at runtime. Make the path consistent (or derive it from $SRC_DIR / repository location) rather than hardcoding a different-cased directory name.

Suggested change

cd $$HOME/proj/GENIAL

cd "$HOME/proj/genial"

plex1 and others added 6 commits January 15, 2026 14:45

stores all output.txt from all runs in the genial output

193be54

updated nodelist, fixed nodelist

dcfff1e

full run refactor and no cancel_on_unavailable

f80fdb8

Merge branch 'store_flowy_output' of https://github.com/huawei-csl/GE…

52aa187

…NIAL into dev/slurm_reliablity_updates

readded pipeline to logs, added check script

4466cb9

added some more statistic to check data dir

f69ed1b

plex1 requested a review from Copilot February 25, 2026 09:11

Copilot started reviewing on behalf of plex1 February 25, 2026 09:12 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev/slurm reliablity updates#3

Dev/slurm reliablity updates#3
plex1 wants to merge 6 commits into
mainfrom
dev/slurm_reliablity_updates

plex1 commented Feb 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		python scan_flowy_data_record.py
		python scan_flowy_data_record.py --base /path/to/run

Conversation

plex1 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

plex1 commented Feb 25, 2026 •

edited

Loading