-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Prerequisites
- I searched existing issues
- I can reproduce this issue
Bug Description
The UAT tests failed https://github.com/NVIDIA/NVSentinel/actions/runs/19476069824/job/55736046742 because the node was never uncondoned. This doesn't seem to happen always, but it looks like it happens some of the time. It seems to be a race between the Dcgm connectivity error and the inform failure.
Component
Fault Management
Steps to Reproduce
- Run the UAT test script on GKE 3+ times, it may fail once
Environment
- NVSentinel version: latest main
- Kubernetes version: 1.33
- Deployment method: helm
Logs/Output
@lalitadithya has logs from the UAT tests, please reach out via Slack DM
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working