-
Notifications
You must be signed in to change notification settings - Fork 87
Description
I experienced trouble with the SLURM implementation of check_ps_unauth_users() in release 1.4.2 of NHC killing interactive jobs. (Jobs submitted via sbatch are left alone.)
Undesired/unexpected behavior
check_ps_unauth_users: foo's "sleep" process is unauthorized. (PID 12347)
check_ps_unauth_users: foo's "/bin/bash" process is unauthorized. (PID 12372)
Upon closer inspection, this appeared to be a result of how the list of users with currently running jobs was calculated:
STAT_OUT=$(${STAT_CMD:-/usr/bin/stat} ${STAT_FMT_ARGS:--c} %U JOBFILE_PATH/job*/slurm_script)
Details
Job files like slurm_script are not created when interactive jobs are launched. Instead, there is a file with the node's hostname and job ID as a part of the filename:
|-- compute-0-2_1084.4294967294
|-- cred_state
|-- cred_state.old
-- job01084
-- slurm_script
Potential solution
I successfully addressed this locally using squeue, which can be configured to report just usernames:
STAT_OUT=$(squeue -w localhost --noheader -o %u)
This should report the username of all users with jobs running on the local node. (If a user is running jobs but not on this node, any processes she has on localhost are unauthorized.)
This has been tested with SLURM 15.08.7.
Please let me know if I have overlooked something or if you have any questions.
Thanks!
(NHC is awesome; thank you!)