Skip to content

[Bug]: NVSentinel sometimes doesn't uncordon nodes #363

@lalitadithya

Description

@lalitadithya

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Bug Description

The UAT tests failed https://github.com/NVIDIA/NVSentinel/actions/runs/19476069824/job/55736046742 because the node was never uncondoned. This doesn't seem to happen always, but it looks like it happens some of the time. It seems to be a race between the Dcgm connectivity error and the inform failure.

Component

Fault Management

Steps to Reproduce

  1. Run the UAT test script on GKE 3+ times, it may fail once

Environment

  • NVSentinel version: latest main
  • Kubernetes version: 1.33
  • Deployment method: helm

Logs/Output

@lalitadithya has logs from the UAT tests, please reach out via Slack DM

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions