Releases · NVIDIA/NVSentinel

This release introduces significant new capabilities for GPU infrastructure monitoring, enhanced automation features, and improved reliability. We've focused on making it easier to understand your GPU environment and giving you more control over how NVSentinel responds to issues.

🎯 Major New Features

GPU Metadata Collection

NVSentinel can now automatically collect detailed information about your GPU hardware, including GPU topology, NVSwitch info, and some hardware specifications. This information helps with troubleshooting and provides better visibility into your GPU infrastructure.

Enhanced Health Event Data

Health events now include rich contextual information about your nodes, including cloud provider details, availability zones, instance types, and CUDA driver versions. This automatic enrichment helps correlate issues across your infrastructure and speeds up root cause analysis.

Intelligent Pattern Detection

The health events analyzer can now detect when multiple issues occur on the same node within a time window. For example, if a node requires remediation multiple times in a short period, NVSentinel can automatically escalate this to support.

Manual Override Capability

You can now manually uncordon a quarantined node, which will automatically cancel the entire automated remediation pipeline for that node. This gives operators direct control when they need to intervene.

Advanced Log Collection

The log collector now automatically gathers AWS SOS reports (sosreport) in addition to existing NVIDIA bug reports and GPU Operator logs. This provides comprehensive diagnostic information for AWS-hosted GPU nodes.

🔧 Configuration & Usability Improvements

Comprehensive Configuration Documentation

The Helm chart now includes extensive inline documentation for all configuration options, making it easier to customize NVSentinel for your environment. A new values-full.yaml reference file provides detailed examples.

Unified Configuration Management

All modules now use a standardized configuration system, making it more consistent and predictable to configure different parts of NVSentinel.

Kata Container Auto-Detection

NVSentinel can now automatically detect when running in Kata containers and adjust its monitoring approach accordingly.

🐛 Bug Fixes & Reliability Improvements

Fault Quarantine Improvements

Fixed: Unnecessary events are no longer propagated to node drainer and fault remediation modules, reducing noise in the system
Fixed: Taints are no longer applied in dry-run mode, allowing you to safely test configurations
Fixed: Race conditions in node monitoring that could cause inconsistent state

Health Monitoring Fixes

Fixed: Health events are now properly sent even when DCGM connectivity temporarily fails
Fixed: GPU falling off the bus is now detected even without specific XID error codes
Fixed: Resource cleanup and connection handling after DCGM failures is more robust
Fixed: Raw journal messages are now fully stored in health events for better debugging

End-to-End Testing

Fixed: Node drainer restarts properly in end-to-end test environments
Fixed: Multiple test flakes and race conditions resolved
Fixed: Log collector configuration paths corrected

Data Flow Optimizations

Fixed: MongoDB change streams now properly handle error conditions
Fixed: Platform connectors fail fast when health events cannot be published, preventing data loss
Fixed: Improved error handling throughout the event processing pipeline

🔒 Security & Compliance Enhancements

SLSA Build Provenance

All container images now include SLSA (Supply chain Levels for Software Artifacts) attestations and Software Bill of Materials (SBOM). Sigstore Policy Controller integration enables verification of build provenance.

Security Scanning

Daily vulnerability scanning implemented for all container images
Security validation now excludes test directories for more focused results

📊 Monitoring & Observability

Improved Metrics

Comprehensive audit and documentation of all Prometheus metrics
Better labeling and organization of metrics across modules
New metrics for manual uncordon operations and pattern detection

Enhanced Logging

Structured logging implemented across all modules for consistency
Reduced log verbosity while maintaining useful information
Better error messages and debugging context

🏗️ Infrastructure & Development

Build System Improvements

Images can now be built with either Docker or ko (Kubernetes-optimized builder)
ARM64 architecture support across all container images
Optimized build times and smaller image sizes
Improved GitHub Actions workflows for faster CI/CD

Dependency Updates

Upgraded to golangci-lint v2 for better code quality checking
Updated multiple cloud provider SDKs (AWS, GCP, Azure)
Updated various Go and Python dependencies to latest stable versions
Updated CUDA base images

📚 Documentation

New Design Documents

GPU metadata retrieval design
Data flow through NVSentinel (from detection through remediation)
Overview documentation explaining what NVSentinel is and why it's important
Integration guides

Updated Guides

All documentation updated to reflect current repository structure
Development guide improvements
Contributing guidelines clarification
Roadmap published showing planned features

🔄 Breaking Changes & Migration Notes

Generic Maintenance Resources

The fault remediation module now uses generic maintenance resources instead of reboot-specific resources. If you're using custom remediation integrations, you may need to update your configurations.

Configuration Schema Changes

Some configuration parameters have been renamed or restructured for consistency. Review the updated values-full.yaml for the latest schema.

📈 Quality Improvements

Testing Infrastructure

Added comprehensive end-to-end tests for all modules
UAT (User Acceptance Testing) framework for AWS environments
Improved test coverage reporting
Better test isolation and reliability

Code Quality

Streamlined Makefiles to reduce duplication and cognitive load
Improved linting rules and enforcement
Better code organization and module boundaries
Reduced technical debt across the codebase

🙏 Acknowledgments

This release includes contributions from 10 contributors, with over 140 commits improving virtually every aspect of NVSentinel:

Thank you to everyone who contributed code, documentation, testing, and feedback!

📦 What's Included

Container Images (14 components)

gpu-health-monitor-dcgm3 / gpu-health-monitor-dcgm4
syslog-health-monitor
csp-health-monitor
metadata-collector
platform-connectors
health-events-analyzer
fault-quarantine
labeler
node-drainer
fault-remediation
janitor
log-collector
file-server-cleanup

🔗 Resources

GitHub Repository: https://github.com/NVIDIA/NVSentinel
Container Registry: ghcr.io/nvidia/nvsentinel
Documentation: See /docs directory in repository
Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
Discussions: https://github.com/NVIDIA/NVSentinel/discussions

⚠️ Known Limitations

This is an experimental/preview release - use caution in production
Some features are disabled by default and must be explicitly enabled
Manual intervention may still be required for certain failure scenarios

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.3.0 \
  --namespace nvsentinel \
  --create-namespace

For detailed installation and configuration instructions, see the README.

Releases: NVIDIA/NVSentinel

Release v0.3.0

🎯 Major New Features

GPU Metadata Collection

Enhanced Health Event Data

Intelligent Pattern Detection

Manual Override Capability

Advanced Log Collection

🔧 Configuration & Usability Improvements

Comprehensive Configuration Documentation

Unified Configuration Management

Kata Container Auto-Detection

🐛 Bug Fixes & Reliability Improvements

Fault Quarantine Improvements

Health Monitoring Fixes

End-to-End Testing

Data Flow Optimizations

🔒 Security & Compliance Enhancements

SLSA Build Provenance

Security Scanning

📊 Monitoring & Observability

Improved Metrics

Enhanced Logging

🏗️ Infrastructure & Development

Build System Improvements

Dependency Updates

📚 Documentation

New Design Documents

Updated Guides

🔄 Breaking Changes & Migration Notes

Generic Maintenance Resources

Configuration Schema Changes

📈 Quality Improvements

Testing Infrastructure

Code Quality

🙏 Acknowledgments

📦 What's Included

Container Images (14 components)

🔗 Resources

⚠️ Known Limitations

🚀 Getting Started

Contributors

Uh oh!

Release v0.2.0

Container Images

Helm Chart

Uh oh!

Release v0.1.0

Container Images

Helm Chart

Uh oh!