Skip to content

[Feature]: Support for rack "management" #367

@lalitadithya

Description

@lalitadithya

Prerequisites

  • I searched existing issues

Feature Summary

In GB200 or other rack type deployment, we need to be able to drain an entire rack when sufficient number of nodes go bad. Currently NVSentinel does a good job of identifying bad nodes in a rack, but it doesn't "escalate" it to a rack level issue when there are persistent bad nodes in a rack.

Problem/Use Case

As an operator I want the ability to remediate an entire rack when 50+% of the nodes in a rack are bad due to switch/other errors.

Proposed Solution

TBD

Component

Fault Management

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions