Skip to content

Releases: NVIDIA/NVSentinel

Release v0.3.0

07 Nov 13:31
v0.3.0
315fd8f

Choose a tag to compare

This release introduces significant new capabilities for GPU infrastructure monitoring, enhanced automation features, and improved reliability. We've focused on making it easier to understand your GPU environment and giving you more control over how NVSentinel responds to issues.

🎯 Major New Features

GPU Metadata Collection

NVSentinel can now automatically collect detailed information about your GPU hardware, including GPU topology, NVSwitch info, and some hardware specifications. This information helps with troubleshooting and provides better visibility into your GPU infrastructure.

Enhanced Health Event Data

Health events now include rich contextual information about your nodes, including cloud provider details, availability zones, instance types, and CUDA driver versions. This automatic enrichment helps correlate issues across your infrastructure and speeds up root cause analysis.

Intelligent Pattern Detection

The health events analyzer can now detect when multiple issues occur on the same node within a time window. For example, if a node requires remediation multiple times in a short period, NVSentinel can automatically escalate this to support.

Manual Override Capability

You can now manually uncordon a quarantined node, which will automatically cancel the entire automated remediation pipeline for that node. This gives operators direct control when they need to intervene.

Advanced Log Collection

The log collector now automatically gathers AWS SOS reports (sosreport) in addition to existing NVIDIA bug reports and GPU Operator logs. This provides comprehensive diagnostic information for AWS-hosted GPU nodes.

🔧 Configuration & Usability Improvements

Comprehensive Configuration Documentation

The Helm chart now includes extensive inline documentation for all configuration options, making it easier to customize NVSentinel for your environment. A new values-full.yaml reference file provides detailed examples.

Unified Configuration Management

All modules now use a standardized configuration system, making it more consistent and predictable to configure different parts of NVSentinel.

Kata Container Auto-Detection

NVSentinel can now automatically detect when running in Kata containers and adjust its monitoring approach accordingly.

🐛 Bug Fixes & Reliability Improvements

Fault Quarantine Improvements

  • Fixed: Unnecessary events are no longer propagated to node drainer and fault remediation modules, reducing noise in the system
  • Fixed: Taints are no longer applied in dry-run mode, allowing you to safely test configurations
  • Fixed: Race conditions in node monitoring that could cause inconsistent state

Health Monitoring Fixes

  • Fixed: Health events are now properly sent even when DCGM connectivity temporarily fails
  • Fixed: GPU falling off the bus is now detected even without specific XID error codes
  • Fixed: Resource cleanup and connection handling after DCGM failures is more robust
  • Fixed: Raw journal messages are now fully stored in health events for better debugging

End-to-End Testing

  • Fixed: Node drainer restarts properly in end-to-end test environments
  • Fixed: Multiple test flakes and race conditions resolved
  • Fixed: Log collector configuration paths corrected

Data Flow Optimizations

  • Fixed: MongoDB change streams now properly handle error conditions
  • Fixed: Platform connectors fail fast when health events cannot be published, preventing data loss
  • Fixed: Improved error handling throughout the event processing pipeline

🔒 Security & Compliance Enhancements

SLSA Build Provenance

All container images now include SLSA (Supply chain Levels for Software Artifacts) attestations and Software Bill of Materials (SBOM). Sigstore Policy Controller integration enables verification of build provenance.

Security Scanning

  • Daily vulnerability scanning implemented for all container images
  • Security validation now excludes test directories for more focused results

📊 Monitoring & Observability

Improved Metrics

  • Comprehensive audit and documentation of all Prometheus metrics
  • Better labeling and organization of metrics across modules
  • New metrics for manual uncordon operations and pattern detection

Enhanced Logging

  • Structured logging implemented across all modules for consistency
  • Reduced log verbosity while maintaining useful information
  • Better error messages and debugging context

🏗️ Infrastructure & Development

Build System Improvements

  • Images can now be built with either Docker or ko (Kubernetes-optimized builder)
  • ARM64 architecture support across all container images
  • Optimized build times and smaller image sizes
  • Improved GitHub Actions workflows for faster CI/CD

Dependency Updates

  • Upgraded to golangci-lint v2 for better code quality checking
  • Updated multiple cloud provider SDKs (AWS, GCP, Azure)
  • Updated various Go and Python dependencies to latest stable versions
  • Updated CUDA base images

📚 Documentation

New Design Documents

  • GPU metadata retrieval design
  • Data flow through NVSentinel (from detection through remediation)
  • Overview documentation explaining what NVSentinel is and why it's important
  • Integration guides

Updated Guides

  • All documentation updated to reflect current repository structure
  • Development guide improvements
  • Contributing guidelines clarification
  • Roadmap published showing planned features

🔄 Breaking Changes & Migration Notes

Generic Maintenance Resources

The fault remediation module now uses generic maintenance resources instead of reboot-specific resources. If you're using custom remediation integrations, you may need to update your configurations.

Configuration Schema Changes

Some configuration parameters have been renamed or restructured for consistency. Review the updated values-full.yaml for the latest schema.

📈 Quality Improvements

Testing Infrastructure

  • Added comprehensive end-to-end tests for all modules
  • UAT (User Acceptance Testing) framework for AWS environments
  • Improved test coverage reporting
  • Better test isolation and reliability

Code Quality

  • Streamlined Makefiles to reduce duplication and cognitive load
  • Improved linting rules and enforcement
  • Better code organization and module boundaries
  • Reduced technical debt across the codebase

🙏 Acknowledgments

This release includes contributions from 10 contributors, with over 140 commits improving virtually every aspect of NVSentinel:

Thank you to everyone who contributed code, documentation, testing, and feedback!

📦 What's Included

Container Images (14 components)

  • gpu-health-monitor-dcgm3 / gpu-health-monitor-dcgm4
  • syslog-health-monitor
  • csp-health-monitor
  • metadata-collector
  • platform-connectors
  • health-events-analyzer
  • fault-quarantine
  • labeler
  • node-drainer
  • fault-remediation
  • janitor
  • log-collector
  • file-server-cleanup

🔗 Resources

⚠️ Known Limitations

  • This is an experimental/preview release - use caution in production
  • Some features are disabled by default and must be explicitly enabled
  • Manual intervention may still be required for certain failure scenarios

🚀 Getting Started

To install this release:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v0.3.0 \
  --namespace nvsentinel \
  --create-namespace

For detailed installation and configuration instructions, see the README.

Release v0.2.0

17 Oct 17:22
v0.2.0
d670f87

Choose a tag to compare

Release v0.2.0

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/NVIDIA/nvsentinel --version v0.2.0

Release v0.1.0

17 Oct 15:38

Choose a tag to compare

Release v0.1.0

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/NVIDIA/nvsentinel --version v0.1.0