Releases: NVIDIA/NVSentinel
Release v0.3.0
This release introduces significant new capabilities for GPU infrastructure monitoring, enhanced automation features, and improved reliability. We've focused on making it easier to understand your GPU environment and giving you more control over how NVSentinel responds to issues.
🎯 Major New Features
GPU Metadata Collection
NVSentinel can now automatically collect detailed information about your GPU hardware, including GPU topology, NVSwitch info, and some hardware specifications. This information helps with troubleshooting and provides better visibility into your GPU infrastructure.
Enhanced Health Event Data
Health events now include rich contextual information about your nodes, including cloud provider details, availability zones, instance types, and CUDA driver versions. This automatic enrichment helps correlate issues across your infrastructure and speeds up root cause analysis.
Intelligent Pattern Detection
The health events analyzer can now detect when multiple issues occur on the same node within a time window. For example, if a node requires remediation multiple times in a short period, NVSentinel can automatically escalate this to support.
Manual Override Capability
You can now manually uncordon a quarantined node, which will automatically cancel the entire automated remediation pipeline for that node. This gives operators direct control when they need to intervene.
Advanced Log Collection
The log collector now automatically gathers AWS SOS reports (sosreport) in addition to existing NVIDIA bug reports and GPU Operator logs. This provides comprehensive diagnostic information for AWS-hosted GPU nodes.
🔧 Configuration & Usability Improvements
Comprehensive Configuration Documentation
The Helm chart now includes extensive inline documentation for all configuration options, making it easier to customize NVSentinel for your environment. A new values-full.yaml reference file provides detailed examples.
Unified Configuration Management
All modules now use a standardized configuration system, making it more consistent and predictable to configure different parts of NVSentinel.
Kata Container Auto-Detection
NVSentinel can now automatically detect when running in Kata containers and adjust its monitoring approach accordingly.
🐛 Bug Fixes & Reliability Improvements
Fault Quarantine Improvements
- Fixed: Unnecessary events are no longer propagated to node drainer and fault remediation modules, reducing noise in the system
- Fixed: Taints are no longer applied in dry-run mode, allowing you to safely test configurations
- Fixed: Race conditions in node monitoring that could cause inconsistent state
Health Monitoring Fixes
- Fixed: Health events are now properly sent even when DCGM connectivity temporarily fails
- Fixed: GPU falling off the bus is now detected even without specific XID error codes
- Fixed: Resource cleanup and connection handling after DCGM failures is more robust
- Fixed: Raw journal messages are now fully stored in health events for better debugging
End-to-End Testing
- Fixed: Node drainer restarts properly in end-to-end test environments
- Fixed: Multiple test flakes and race conditions resolved
- Fixed: Log collector configuration paths corrected
Data Flow Optimizations
- Fixed: MongoDB change streams now properly handle error conditions
- Fixed: Platform connectors fail fast when health events cannot be published, preventing data loss
- Fixed: Improved error handling throughout the event processing pipeline
🔒 Security & Compliance Enhancements
SLSA Build Provenance
All container images now include SLSA (Supply chain Levels for Software Artifacts) attestations and Software Bill of Materials (SBOM). Sigstore Policy Controller integration enables verification of build provenance.
Security Scanning
- Daily vulnerability scanning implemented for all container images
- Security validation now excludes test directories for more focused results
📊 Monitoring & Observability
Improved Metrics
- Comprehensive audit and documentation of all Prometheus metrics
- Better labeling and organization of metrics across modules
- New metrics for manual uncordon operations and pattern detection
Enhanced Logging
- Structured logging implemented across all modules for consistency
- Reduced log verbosity while maintaining useful information
- Better error messages and debugging context
🏗️ Infrastructure & Development
Build System Improvements
- Images can now be built with either Docker or ko (Kubernetes-optimized builder)
- ARM64 architecture support across all container images
- Optimized build times and smaller image sizes
- Improved GitHub Actions workflows for faster CI/CD
Dependency Updates
- Upgraded to golangci-lint v2 for better code quality checking
- Updated multiple cloud provider SDKs (AWS, GCP, Azure)
- Updated various Go and Python dependencies to latest stable versions
- Updated CUDA base images
📚 Documentation
New Design Documents
- GPU metadata retrieval design
- Data flow through NVSentinel (from detection through remediation)
- Overview documentation explaining what NVSentinel is and why it's important
- Integration guides
Updated Guides
- All documentation updated to reflect current repository structure
- Development guide improvements
- Contributing guidelines clarification
- Roadmap published showing planned features
🔄 Breaking Changes & Migration Notes
Generic Maintenance Resources
The fault remediation module now uses generic maintenance resources instead of reboot-specific resources. If you're using custom remediation integrations, you may need to update your configurations.
Configuration Schema Changes
Some configuration parameters have been renamed or restructured for consistency. Review the updated values-full.yaml for the latest schema.
📈 Quality Improvements
Testing Infrastructure
- Added comprehensive end-to-end tests for all modules
- UAT (User Acceptance Testing) framework for AWS environments
- Improved test coverage reporting
- Better test isolation and reliability
Code Quality
- Streamlined Makefiles to reduce duplication and cognitive load
- Improved linting rules and enforcement
- Better code organization and module boundaries
- Reduced technical debt across the codebase
🙏 Acknowledgments
This release includes contributions from 10 contributors, with over 140 commits improving virtually every aspect of NVSentinel:
- @lalitadithya
- @mchmarny
- @dims
- @XRFXLP
- @KaivalyaMDabhadkar
- @rupalis-nv
- @Gyan172004
- @nitz2407
- @tabern
- @tanishagoyal2
Thank you to everyone who contributed code, documentation, testing, and feedback!
📦 What's Included
Container Images (14 components)
gpu-health-monitor-dcgm3/gpu-health-monitor-dcgm4syslog-health-monitorcsp-health-monitormetadata-collectorplatform-connectorshealth-events-analyzerfault-quarantinelabelernode-drainerfault-remediationjanitorlog-collectorfile-server-cleanup
🔗 Resources
- GitHub Repository: https://github.com/NVIDIA/NVSentinel
- Container Registry: ghcr.io/nvidia/nvsentinel
- Documentation: See
/docsdirectory in repository - Issue Tracker: https://github.com/NVIDIA/NVSentinel/issues
- Discussions: https://github.com/NVIDIA/NVSentinel/discussions
⚠️ Known Limitations
- This is an experimental/preview release - use caution in production
- Some features are disabled by default and must be explicitly enabled
- Manual intervention may still be required for certain failure scenarios
🚀 Getting Started
To install this release:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v0.3.0 \
--namespace nvsentinel \
--create-namespaceFor detailed installation and configuration instructions, see the README.
Release v0.2.0
Release v0.2.0
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/NVIDIA/nvsentinel --version v0.2.0
Release v0.1.0
Release v0.1.0
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/NVIDIA/nvsentinel --version v0.1.0