-
Notifications
You must be signed in to change notification settings - Fork 20
Open
Description
Prerequisites
- I searched existing issues
Feature Summary
As per the public XID documentation, WORKFLOW_NVLINK_ERR has a complex remediation workflow:
Extract the hex strings from the Xid error message. Note that there should be seven fields in the Xid. Unused fields would expect to be 0x0 rather than a full DWORD of 0’s. The first, third, fourth and fifth registers are valid for Hopper-based products.
Evaluate the populate(d) registers. If bits other than those specifically outlined below are seen, please report a bug.
First register:
Bit 0, 23, 30: Can be safely ignored.
Bits 1, 20: These are generally sympathetic or secondary errors. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug.
Bits 4 or 5: Likely HW issue with ECC/Parity --> If seen more than 2 times on the same link, report a bug.
Bits 8, 9, 12, 16, 17, 24, 28: Could possibly be a HW issue: Check link mechanical connections and re-seat if a field resolution is required. Run diags if issue persists. If the issue persist, and diagnostics has passed please report a bug.
Bits 21 or 22: Marginal channel SI issue. If other errors accompany this Xid, follow the resolution for those first. Otherwise, check link mechanical connections. Run Field Diags and report a bug.
Bits 27, 29: If seen repeatedly, please report a bug.
Third register:
Bits 0, 1, 2, 6: Likely HW issue with ECC/Parity --> If seen more than 2 times on the same link, report a bug.
Bit 13: Not expected to be seen in production. If seen, please report a bug.
Bits 16, 19: If seen repeatedly, please run Field Diags and report a bug
Bits 17, 18: If seen repeatedly, please report a bug.
Fourth register:
Bits 16, 17: These are generally sympathetic or secondary errors. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug.
Bit 18: These are generally sympathetic or secondary errors, though a reset of the fabric is required. If seen with other bits set or other Xid/SXid, please follow the resolution for those. If seen solo, please report a bug.
Fifth register:
Bits 18, 19, 21, 22, 24, 25, 27, 28: Likely HW issue with ECC/Parity --> If seen more than 2 times on the same link, report a bug.
Bits 20, 23, 26, 29: These errors represent a threshold of ECC errors being exceeded. There was no uncorrectable error at this time. Continue operation. If desired, Field Diags can be run to check for link integrity.
Currently NVSentinel doesn't support this and defaults to CONTACT_SUPPORT, which required manual intervention.
Problem/Use Case
Automatic remediation for WORKFLOW_NVLINK_ERR
Proposed Solution
TBD
Component
Fault Management
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request