-
Notifications
You must be signed in to change notification settings - Fork 20
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Prerequisites
- I searched existing issues
Feature Summary
In GB200 or other rack type deployment, we need to be able to drain an entire rack when sufficient number of nodes go bad. Currently NVSentinel does a good job of identifying bad nodes in a rack, but it doesn't "escalate" it to a rack level issue when there are persistent bad nodes in a rack.
Problem/Use Case
As an operator I want the ability to remediate an entire rack when 50+% of the nodes in a rack are bad due to switch/other errors.
Proposed Solution
TBD
Component
Fault Management
mchmarny
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request