Skip to content

Templated batch commands in gcm run#68

Open
Wesley-J-Davis wants to merge 20 commits into
mainfrom
templated-batch-commands-in-gcm_run
Open

Templated batch commands in gcm run#68
Wesley-J-Davis wants to merge 20 commits into
mainfrom
templated-batch-commands-in-gcm_run

Conversation

@Wesley-J-Davis
Copy link
Copy Markdown
Contributor

The issue at hand was that the automated job name and listing name commands were failing on NAS.

New job and listing name change templates were added to the command template section in gcm_setup where the environment vars are set that will be stitched into the code via a sed sequence.

these new commands were written to their own files and then edited in via the same process as other env vars with complex compositions.

the listing file name is only changed at the very beginning (where it begins as a .FAILED), upon failure codes that identify where the code failed, and at the very end upon successful completion.

The job name is changed more frequently, to highlight on the qcheck monitoring software where in the process the code currently is.

also fixed bad location for tick and removed all qalters
needed to use single quotes and get rid of the backslashes to properly expand commands
we start with .FAILED

we change to DONE when done

we marked WHAT failed when GCM or ODAS failures occur
fixed header in gcm_run.j. @BATCH_NAME was an unused variable. Changed to @BATCH_JOBNAME and removed -o since it was included in var.
the vars that will be inserted have the proper spacing set already
@Wesley-J-Davis Wesley-J-Davis requested a review from a team as a code owner March 26, 2026 16:51
@Wesley-J-Davis Wesley-J-Davis linked an issue Mar 26, 2026 that may be closed by this pull request
@mathomp4
Copy link
Copy Markdown
Member

@Wesley-J-Davis I think there is still a few issues. The first one I see is this:

#@BATCH_OUTPUTNAME@RUN_N.%j.FAILED

which leads to in my testing:

test-S2S-C1_RUN.%j.FAILED

The issue is that %j is a SLURM-ism. PBS does not do variable expansion on #PBS lines.

I'm thinking you might want to only add %j only at NCCS? Or I guess .%j at NCCS and an empty string at NAS?

I then get a couple more odd lines:

Error; invalid month (0) in initial date: @RSTDATE; at /nobackupp18/mathomp4/Models/GEOS-S2S-3-Athena/Linux/bin/Manipulate_time.pm line 731.
qalter: No resources requested 97757.pbs06a.hsn.ath.nas.nasa.gov

I think the first is because I'm not running EMIP? So I'll ignore. But the second one is think is from:

qalter -N ${EXPID}.${RUN_STATUS} ${PBS_JOBID}

which is what I get at NAS.

I asked Claude a bit and it says:

The man page confirms it at lines 74-77:

If a job is running, the only resources that can be modified are mppnodes, mppt, cput, walltime, min_walltime, and max_walltime.

And critically at lines 243-244:

If any of the modifications to a job fails, none of the job's attributes is modified.

So -N (rename) is simply not supported on a running job in PBS Pro — the man page doesn't list it among the allowed running-job modifications. The true no-op we already put in is the correct fix. qalter -N will never work on a running job at NAS regardless of what arguments you pass it.

The "true no-op" is it saying that at NAS that qalter line should just be true since you can't rename.

… correctly

nas (PBS) doesn't support variable expansion in the job name with the %j. adjusting the code so that it gets written in properly over there.
@Wesley-J-Davis
Copy link
Copy Markdown
Contributor Author

I've made adjustments in gcm_run and gcm_setup that will populate the initial output name dependent upon the SITE.

If NCCS, it will include the %j to dynamically write the job number into the output listing file.

If NAS, the %j is left off completely since PBS doesn't support that type of variable expansion.

As for the qalter -N at NAS, are you saying to just forget about renaming jobs at NAS because that capability is not supported and have those lines just disappear entirely?

@mathomp4
Copy link
Copy Markdown
Member

As for the qalter -N at NAS, are you saying to just forget about renaming jobs at NAS because that capability is not supported and have those lines just disappear entirely?

@Wesley-J-Davis Yeah. Probably. Since PBS can't change running job names in PBS you might as well just not support that behavior there.

I think it's probably safe since @qvis runs S2Sv3 operations out at NAS in a different way.

removed the . in between run_n and batch_outputname_amendment, included it in the string that will get pasted in there.

done in support of removing job rename function for experiments at NAS
changed batch outputname amendment to properly name output file, keeping relevant info separated by periods.

removed batch_change_outputname command and replaced with empty string for NAS.
@Wesley-J-Davis
Copy link
Copy Markdown
Contributor Author

I removed the batch_change_jobname commands in the gcm_setup and replaced with empty string that will get written in to gcm_run.j files created at NAS. So no job renames will be attempted at NAS.

Mildly altered the batch parameters in gcm_run.j template, removing a . and insterting said . into the proper lines in gcm_setup.

at NAS it will end up being '.FAILED' to start and at NCCS it will end up being '.%j.FAILED' to start.

Copy link
Copy Markdown
Collaborator

@amolod amolod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wesley did you try this at NAS? Perhaps worth doing?

@mathomp4
Copy link
Copy Markdown
Member

Wesley did you try this at NAS? Perhaps worth doing?

Let me give this a try. I already have something set up for this...

@mathomp4 mathomp4 added the Skip Changelog Skips the Changelog Enforcer label Apr 13, 2026
@mathomp4
Copy link
Copy Markdown
Member

Okay. Some observations.

  1. It submits and runs just fine, so yay.
  2. When you make a new experiment, the output job file name is created like:
#PBS -o test-S2S-C1_RUN.FAILED
  1. All the qalter commands probably do nothing? I don't see anything happening.
  2. When the job successfully ends, the output file is still called test-S2S-C1_RUN.FAILED

Maybe at NAS we shouldn't append the .FAILED suffix? Not sure what it's purpose is...

@Wesley-J-Davis
Copy link
Copy Markdown
Contributor Author

Wesley-J-Davis commented Apr 14, 2026

All the qalter commands probably do nothing? I don't see anything happening.

Seems like I fouled that up, I altered the job rename command for NAS to be as follows:

setenv BATCH_CHANGE_JOBNAME 'qalter -N ${EXPID}.${RUN_STATUS} ${PBS_JOBID}'
setenv BATCH_CHANGE_OUTPUTNAME 'qalter -o ${EXPDIR}/${EXPID}.${qdate}.${PBS_JOBID}.${RUN_STATUS} ${PBS_JOBID}'

I failed to include the correct commands for PBS at NAS for the job rename, that's been resolved. I also removed the .FAILED from the initial listing. It should now show the status of the job in the job name as it runs.

When the job successfully ends, the output file is still called test-S2S-C1_RUN.FAILED
Maybe at NAS we shouldn't append the .FAILED suffix? Not sure what it's purpose is...

The purpose of the .FAILED suffix is to readily identify a failed logfile in a pile of logfiles. It's standard for the ops teams listing files and the idea is to remove said suffix once the run successfully completes.

I can take it or leave it at NAS. I don't actually think it's needed over there, so I removed it.

setenv BATCH_OUTPUTNAME_AMENDMENT ""

Comment thread src/Applications/GEOSgcm_App/gcm_setup Outdated
setenv BATCH_JOBNAME "PBS -N " # PBS Syntax for job name
setenv BATCH_OUTPUTNAME "PBS -o " # PBS Syntax for job output name
setenv BATCH_CHANGE_JOBNAME ''
setenv BATCH_CHANGE_JOBNAME 'qalter -N ${EXPID}.${RUN_STATUS} ${PBS_JOBID}'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wesley-J-Davis I'm not sure this works at NAS. As I said above, you can't run qalter -N on PBS Pro.

The man page confirms it at lines 74-77:

If a job is running, the only resources that can be modified are mppnodes, mppt, cput, walltime, min_walltime, and max_walltime.

And critically at lines 243-244:

If any of the modifications to a job fails, none of the job's attributes is modified.

So -N (rename) is simply not supported on a running job in PBS Pro — the man page doesn't list it among the allowed running-job modifications. The true no-op we already put in is the correct fix. qalter -N will never work on a running job at NAS regardless of what arguments you pass it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathomp4 Thank you for for the clarity on what is and isn't possible over at NAS with PBS. I've adjusted the code to remove the job rename commands entirely at NAS.

At NAS, it is impossible to run a batch command that changes the job name while the job is running. PBS simply does not support it as per the PBS man page. 

From @mathomp4 
```
The man page confirms it at lines 74-77:

If a job is running, the only resources that can be modified are mppnodes, mppt, cput, walltime, min_walltime, and max_walltime.

And critically at lines 243-244:

If any of the modifications to a job fails, none of the job's attributes is modified.
```

Replaced job name change with empty string for NAS operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Skip Changelog Skips the Changelog Enforcer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Current gcm_run.j is broken at NAS

3 participants