Skip to content

Helper scripts are not called when the node fails the health check with Slurm #147

@szhengac

Description

@szhengac

Hi,

I am testing nhc with Slurm to automatically drain the nodes with ECC uncorrectable error. The nhc log shows the health check fails on the problematic node, but no helper scripts are executed to put the node into drain state. If I manually call the helper script like sudo NHC_RM=slurm bash /usr/libexec/nhc/node-mark-offline ib-vm-25, the node will be put on drain state. How can I enable Slurm and nhc to call the helper scripts automatically when the node fails the health check? Thanks!

/var/log/nhc.log:

Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions