There are several issues related to reporting errors to users by looking at non-job specific data:
flux-framework/flux-core#7135
flux-framework/flux-core#7136
We would like to parse resource.eventlog to associate a drain event with a specific job, however the current lack of extra drain information makes this minimally inconvenient, but likely very hard when a number of corner cases are thrown in.
Example, lets say that housekeeping fails and drains a node. A user or admin may want to associate the housekeeping failure with the job that just ran before it. In order to do that we would have to 1) see what time the node was drained 2) find the last job that finished on that node before that timestamp.
But this is the algorithm only if we apriori know housekeeping drained the node. The algorithm for associating a drain event from epilog is different. Also, nothing indicates where the drain event may have come from. What if an admin drains a node for maintenance after a job runs, it could appear it came from housekeeping.
I think we should have additional information in the drain event when that information is available.
-
jobid - optional field could tie drain event to a specific jobid. This could be automatic in flux resource drain if the FLUX_JOB_ID environment variable exists.
-
reporter - could be housekeeping, epilog, "user" (command line). Like if an admin drains a node for some very specific purpose, we'll know it wasn't related to a job per se.
Maybe some other info could be useful. Only bits I thought of so far.
There are several issues related to reporting errors to users by looking at non-job specific data:
flux-framework/flux-core#7135
flux-framework/flux-core#7136
We would like to parse
resource.eventlogto associate a drain event with a specific job, however the current lack of extra drain information makes this minimally inconvenient, but likely very hard when a number of corner cases are thrown in.Example, lets say that housekeeping fails and drains a node. A user or admin may want to associate the housekeeping failure with the job that just ran before it. In order to do that we would have to 1) see what time the node was drained 2) find the last job that finished on that node before that timestamp.
But this is the algorithm only if we apriori know housekeeping drained the node. The algorithm for associating a drain event from epilog is different. Also, nothing indicates where the drain event may have come from. What if an admin drains a node for maintenance after a job runs, it could appear it came from housekeeping.
I think we should have additional information in the drain event when that information is available.
jobid - optional field could tie drain event to a specific jobid. This could be automatic in
flux resource drainif the FLUX_JOB_ID environment variable exists.reporter - could be housekeeping, epilog, "user" (command line). Like if an admin drains a node for some very specific purpose, we'll know it wasn't related to a job per se.
Maybe some other info could be useful. Only bits I thought of so far.