Templated batch commands in gcm run#68
Conversation
also fixed bad location for tick and removed all qalters
needed to use single quotes and get rid of the backslashes to properly expand commands
we start with .FAILED we change to DONE when done we marked WHAT failed when GCM or ODAS failures occur
fixed header in gcm_run.j. @BATCH_NAME was an unused variable. Changed to @BATCH_JOBNAME and removed -o since it was included in var.
the vars that will be inserted have the proper spacing set already
|
@Wesley-J-Davis I think there is still a few issues. The first one I see is this: which leads to in my testing: The issue is that I'm thinking you might want to only add I then get a couple more odd lines: I think the first is because I'm not running EMIP? So I'll ignore. But the second one is think is from: which is what I get at NAS. I asked Claude a bit and it says:
The "true no-op" is it saying that at NAS that |
… correctly nas (PBS) doesn't support variable expansion in the job name with the %j. adjusting the code so that it gets written in properly over there.
|
I've made adjustments in gcm_run and gcm_setup that will populate the initial output name dependent upon the SITE. If NCCS, it will include the %j to dynamically write the job number into the output listing file. If NAS, the %j is left off completely since PBS doesn't support that type of variable expansion. As for the qalter -N at NAS, are you saying to just forget about renaming jobs at NAS because that capability is not supported and have those lines just disappear entirely? |
@Wesley-J-Davis Yeah. Probably. Since PBS can't change running job names in PBS you might as well just not support that behavior there. I think it's probably safe since @qvis runs S2Sv3 operations out at NAS in a different way. |
removed the . in between run_n and batch_outputname_amendment, included it in the string that will get pasted in there. done in support of removing job rename function for experiments at NAS
changed batch outputname amendment to properly name output file, keeping relevant info separated by periods. removed batch_change_outputname command and replaced with empty string for NAS.
|
I removed the batch_change_jobname commands in the gcm_setup and replaced with empty string that will get written in to gcm_run.j files created at NAS. So no job renames will be attempted at NAS. Mildly altered the batch parameters in gcm_run.j template, removing a . and insterting said . into the proper lines in gcm_setup. at NAS it will end up being '.FAILED' to start and at NCCS it will end up being '.%j.FAILED' to start. |
amolod
left a comment
There was a problem hiding this comment.
Wesley did you try this at NAS? Perhaps worth doing?
Let me give this a try. I already have something set up for this... |
|
Okay. Some observations.
Maybe at NAS we shouldn't append the |
Seems like I fouled that up, I altered the job rename command for NAS to be as follows: I failed to include the correct commands for PBS at NAS for the job rename, that's been resolved. I also removed the .FAILED from the initial listing. It should now show the status of the job in the job name as it runs.
The purpose of the .FAILED suffix is to readily identify a failed logfile in a pile of logfiles. It's standard for the ops teams listing files and the idea is to remove said suffix once the run successfully completes. I can take it or leave it at NAS. I don't actually think it's needed over there, so I removed it.
|
| setenv BATCH_JOBNAME "PBS -N " # PBS Syntax for job name | ||
| setenv BATCH_OUTPUTNAME "PBS -o " # PBS Syntax for job output name | ||
| setenv BATCH_CHANGE_JOBNAME '' | ||
| setenv BATCH_CHANGE_JOBNAME 'qalter -N ${EXPID}.${RUN_STATUS} ${PBS_JOBID}' |
There was a problem hiding this comment.
@Wesley-J-Davis I'm not sure this works at NAS. As I said above, you can't run qalter -N on PBS Pro.
The man page confirms it at lines 74-77:
If a job is running, the only resources that can be modified are mppnodes, mppt, cput, walltime, min_walltime, and max_walltime.
And critically at lines 243-244:
If any of the modifications to a job fails, none of the job's attributes is modified.
So -N (rename) is simply not supported on a running job in PBS Pro — the man page doesn't list it among the allowed running-job modifications. The true no-op we already put in is the correct fix. qalter -N will never work on a running job at NAS regardless of what arguments you pass it.
There was a problem hiding this comment.
@mathomp4 Thank you for for the clarity on what is and isn't possible over at NAS with PBS. I've adjusted the code to remove the job rename commands entirely at NAS.
At NAS, it is impossible to run a batch command that changes the job name while the job is running. PBS simply does not support it as per the PBS man page. From @mathomp4 ``` The man page confirms it at lines 74-77: If a job is running, the only resources that can be modified are mppnodes, mppt, cput, walltime, min_walltime, and max_walltime. And critically at lines 243-244: If any of the modifications to a job fails, none of the job's attributes is modified. ``` Replaced job name change with empty string for NAS operations.
The issue at hand was that the automated job name and listing name commands were failing on NAS.
New job and listing name change templates were added to the command template section in gcm_setup where the environment vars are set that will be stitched into the code via a sed sequence.
these new commands were written to their own files and then edited in via the same process as other env vars with complex compositions.
the listing file name is only changed at the very beginning (where it begins as a .FAILED), upon failure codes that identify where the code failed, and at the very end upon successful completion.
The job name is changed more frequently, to highlight on the qcheck monitoring software where in the process the code currently is.