-
Notifications
You must be signed in to change notification settings - Fork 87
Open
Description
Hi,
I am testing nhc with Slurm to automatically drain the nodes with ECC uncorrectable error. The nhc log shows the health check fails on the problematic node, but no helper scripts are executed to put the node into drain state. If I manually call the helper script like sudo NHC_RM=slurm bash /usr/libexec/nhc/node-mark-offline ib-vm-25, the node will be put on drain state. How can I enable Slurm and nhc to call the helper scripts automatically when the node fails the health check? Thanks!
/var/log/nhc.log:
Found XID errors: 63
Node Health Check failed. Check check_xid_errors returned 1
ERROR: nhc: Health check failed: Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed. Check check_xid_errors returned 1
ERROR: nhc: Health check failed: Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed. Check check_xid_errors returned 1
ERROR: nhc: Health check failed: Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed. Check check_xid_errors returned 1
ERROR: nhc: Health check failed: Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed. Check check_xid_errors returned 1
ERROR: nhc: Health check failed: Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed. Check check_xid_errors returned 1
ERROR: nhc: Health check failed: Check check_xid_errors returned 1
Metadata
Metadata
Assignees
Labels
No labels