Skip to content

rfc44: support jobid and possibly other fields in drain events #509

@chu11

Description

@chu11

There are several issues related to reporting errors to users by looking at non-job specific data:

flux-framework/flux-core#7135
flux-framework/flux-core#7136

We would like to parse resource.eventlog to associate a drain event with a specific job, however the current lack of extra drain information makes this minimally inconvenient, but likely very hard when a number of corner cases are thrown in.

Example, lets say that housekeeping fails and drains a node. A user or admin may want to associate the housekeeping failure with the job that just ran before it. In order to do that we would have to 1) see what time the node was drained 2) find the last job that finished on that node before that timestamp.

But this is the algorithm only if we apriori know housekeeping drained the node. The algorithm for associating a drain event from epilog is different. Also, nothing indicates where the drain event may have come from. What if an admin drains a node for maintenance after a job runs, it could appear it came from housekeeping.

I think we should have additional information in the drain event when that information is available.

  • jobid - optional field could tie drain event to a specific jobid. This could be automatic in flux resource drain if the FLUX_JOB_ID environment variable exists.

  • reporter - could be housekeeping, epilog, "user" (command line). Like if an admin drains a node for some very specific purpose, we'll know it wasn't related to a job per se.

Maybe some other info could be useful. Only bits I thought of so far.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions