Skip to content

Conversation

@nitz2407
Copy link
Contributor

Summary

cleanup older dcgm handle which are not getting used in gpu-health-monitor as whenever new handle is created older handle is not getting garbage collected due to which dcgm thinks that those handle is valid.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Health Monitors
  • Core Services
  • Fault Management
  • Documentation/CI
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Fix is working fine as tried 100 iterations of deletion gpu-health-monitor pods and everytime dgcm handle created successfully.

Screenshot 2025-10-17 at 10 07 12 PM

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@dims
Copy link
Collaborator

dims commented Oct 17, 2025

/ok to test 9006aff

@lalitadithya
Copy link
Collaborator

please rebase

@nitz2407
Copy link
Contributor Author

please rebase

Done

@lalitadithya
Copy link
Collaborator

Fix is working fine as tried 100 iterations of deletion gpu-health-monitor pods and everytime dgcm handle created successfully.

Can you try deleting DCGM, waiting for the connectivity check to fail and then reenable DCGM? I think that is another way in which this issue can be reproduced.

@nitz2407
Copy link
Contributor Author

issue

Tried couple of iterations for dcgm deletion and dcgm handle created successfully after dcgm pod get restore.

Screenshot 2025-10-24 at 5 45 43 PM

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical resource leak in the GPU health monitor's DCGM (Data Center GPU Manager) handle management. The issue occurred when the gpu-health-monitor pod restarted or experienced connectivity failures—old DCGM handles were not properly released, causing DCGM to maintain stale references that blocked new handle creation. The fix adds an explicit dcgm_handle.Shutdown() call before deleting handles (ensuring the DCGM library releases internal resources) and resets all related state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) after initialization or connectivity failures. This ensures clean retry attempts by preventing the code from operating with partially initialized state. The change integrates with the existing retry loop in the _setup_dcgm method, which checks if dcgm_handle is None to decide whether reinitialization is needed.

Important Files Changed

Filename Score Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py 4/5 Added Shutdown() call before handle deletion and full state reset after initialization/connectivity failures to prevent DCGM handle leaks

Confidence score: 4/5

  • This PR is safe to merge with low risk of production issues
  • Score reflects thorough manual testing (100 iterations) and a straightforward fix to a well-isolated resource leak, though the lack of automated tests for this specific failure path and the addition of multiple state resets in error handling paths introduce minor risk
  • The _cleanup_dcgm_resources method (lines 263-264) requires close attention to ensure the shutdown order is correct and that no exceptions during shutdown could leave partial state

Sequence Diagram

sequenceDiagram
    participant User
    participant DCGMWatcher
    participant ThreadPoolExecutor
    participant DCGM as "DCGM (pydcgm)"
    participant Callbacks as "Callback Functions"
    
    User->>DCGMWatcher: "start(fields_to_monitor, exit_event)"
    
    loop Until exit event is set
        DCGMWatcher->>DCGMWatcher: "Check if dcgm_handle is None"
        
        alt dcgm_handle is None
            DCGMWatcher->>DCGM: "_get_dcgm_handle()"
            alt Handle creation successful
                DCGM-->>DCGMWatcher: "Return dcgm_handle"
                DCGMWatcher->>DCGM: "_initialize_dcgm_monitoring()"
                DCGM->>DCGM: "GetEntityGroupEntities(GPU)"
                DCGM->>DCGM: "GetEntityGroupEntities(SWITCH)"
                DCGM->>DCGM: "Create DcgmGroup"
                DCGM->>DCGM: "Set health watches"
                DCGM->>DCGM: "Get GPU serial numbers"
                DCGM-->>DCGMWatcher: "Return dcgm_group, gpu_ids, gpu_serials"
            else Handle creation failed
                DCGM-->>DCGMWatcher: "Raise exception"
                DCGMWatcher->>DCGMWatcher: "_cleanup_dcgm_resources()"
                DCGMWatcher->>ThreadPoolExecutor: "_fire_callback_funcs(dcgm_connectivity_failed)"
                ThreadPoolExecutor->>Callbacks: "dcgm_connectivity_failed()"
                DCGMWatcher->>DCGMWatcher: "Reset state (handle, group, ids, serials)"
            end
        else dcgm_handle exists
            DCGMWatcher->>DCGM: "_perform_health_check(dcgm_group)"
            alt Health check successful
                DCGM->>DCGM: "dcgm_group.health.Check()"
                DCGM-->>DCGMWatcher: "Return health_details, connectivity_success=True"
                DCGMWatcher->>DCGMWatcher: "Process incidents and accumulate failures"
                DCGMWatcher->>ThreadPoolExecutor: "_fire_callback_funcs(health_event_occurred)"
                ThreadPoolExecutor->>Callbacks: "health_event_occurred(health_status, gpu_ids, gpu_serials)"
            else Health check failed (timeout/error)
                DCGM-->>DCGMWatcher: "Return empty health_status, connectivity_success=False"
                DCGMWatcher->>DCGMWatcher: "_cleanup_dcgm_resources()"
                DCGMWatcher->>DCGM: "Shutdown and delete handle"
                DCGMWatcher->>DCGMWatcher: "Reset state (handle, group, ids, serials)"
            end
        end
        
        DCGMWatcher->>DCGMWatcher: "Wait for poll_interval_seconds"
    end
    
    DCGMWatcher->>DCGMWatcher: "_cleanup_dcgm_resources()"
    DCGMWatcher->>ThreadPoolExecutor: "shutdown(cancel_futures=True)"
    DCGMWatcher-->>User: "Exit monitoring loop"
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a DCGM handle leak in the gpu-health-monitor component that occurred when monitor pods were restarted. The core issue was that DCGM handle objects were being deleted in Python without calling the underlying Shutdown() method, causing DCGM's internal state to retain references to stale handles. The fix adds explicit dcgm_handle.Shutdown() calls before deletion and ensures all state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) are consistently reset after cleanup on both initialization and connectivity failure paths. This integrates with the existing _cleanup_dcgm_resources() helper method in the DCGM watcher module, which manages the lifecycle of DCGM group objects.

Important Files Changed

Filename Score Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py 5/5 Adds explicit DCGM handle shutdown and state reset to fix resource leak on pod restart

Confidence score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes follow a well-established cleanup pattern, add proper resource lifecycle management with explicit Shutdown() calls, and include state resets on all failure paths; the fix has been validated through 100 iterations of pod deletion/recreation
  • No files require special attention

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. This PR fixes a critical resource leak in the GPU health monitor's DCGM (Data Center GPU Manager) handle management. When the monitor encountered failures and attempted to recreate DCGM connections, old handles were not being properly released, causing DCGM to retain stale references that prevented new handle creation. The fix adds explicit Shutdown() calls on handles before deletion and resets all state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) to None/empty after cleanup. This ensures clean recovery from transient failures during both initialization and connectivity checks. The change integrates with the existing _cleanup_dcgm_resources helper method and follows the established error-handling patterns in health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py.

Important Files Changed

Filename Score Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py 5/5 Adds DCGM handle cleanup via Shutdown() and state variable resets on initialization and connectivity failures

Confidence score: 5/5

  • This PR is safe to merge with minimal risk and fixes a critical reliability issue in production.
  • Score reflects a targeted fix with clear before/after behavior, successful manual validation (100 pod deletion iterations), and no breaking changes—the fix only adds proper cleanup where it was missing.
  • No files require special attention; the change is well-scoped to error-handling paths that were previously incomplete.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@nitz2407
Copy link
Contributor Author

/ok to test 6d40f23

1 similar comment
@XRFXLP
Copy link
Member

XRFXLP commented Oct 27, 2025

/ok to test 6d40f23

@github-actions
Copy link

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/commons/pkg/logger 0.00% (ø)
github.com/nvidia/nvsentinel/commons/pkg/server 0.00% (ø)
github.com/nvidia/nvsentinel/data-models/pkg/model 0.00% (ø)
github.com/nvidia/nvsentinel/data-models/pkg/protos 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker 25.90% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/common 1.27% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator 25.55% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation 37.67% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer 38.99% (+0.22%) 👍
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/nodeinfo 48.86% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler 18.21% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common 56.52% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus 65.03% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler 28.16% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer 0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/protos 0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher 35.48% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler 56.77% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/cmd/csp-health-monitor 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/cmd/maintenance-notifier 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/gcp 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/datastore 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/protos 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/lsnvlink 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/types 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser 0.00% (ø)
github.com/nvidia/nvsentinel/janitor 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1 51.85% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller 75.23% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/csp 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics 80.00% (ø)
github.com/nvidia/nvsentinel/labeler-module 0.00% (ø)
github.com/nvidia/nvsentinel/labeler-module/pkg/labeler 71.93% (ø)
github.com/nvidia/nvsentinel/node-drainer-module 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/evaluator 44.89% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/informers 32.79% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/mongodb 12.05% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/queue 69.01% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler 75.83% (ø)
github.com/nvidia/nvsentinel/platform-connectors 0.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes 84.19% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store 48.92% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/protos 0.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer 100.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/server 0.00% (ø)
github.com/nvidia/nvsentinel/statemanager 90.91% (ø)
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher 75.28% (-0.74%) 👎
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)
github.com/nvidia/nvsentinel/tilt/simple-health-client 0.00% (ø)
github.com/nvidia/nvsentinel/tilt/simple-health-client/protos 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/commons/pkg/logger/logger.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/commons/pkg/server/server.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/data-models/pkg/model/maintenance_event.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/data-models/pkg/protos/health_event.pb.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/data-models/pkg/protos/health_event_grpc.pb.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine-module/main.go 0.00% (ø) 630 0 630
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/breaker.go 25.60% (ø) 1004 257 747
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/common/healthEventsBuffer.go 1.27% (ø) 79 1 78
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator.go 24.03% (ø) 670 161 509
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go 26.77% (+0.51%) 198 53 (+1) 145 (-1) 👍
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_all.go 32.65% (ø) 49 16 33
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_any.go 30.51% (-3.39%) 59 18 (-2) 41 (+2) 👎
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_iface.go 33.33% (+5.56%) 18 6 (+1) 12 (-1) 👍
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation/healthEventsAnnotationMap.go 37.67% (ø) 507 191 316
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/node_informer.go 38.99% (+0.22%) 908 354 (+2) 554 (-2) 👍
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/nodeinfo/nodeinfo.go 48.86% (ø) 219 107 112
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/node_quarantine.go 18.59% (ø) 1065 198 867
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler.go 18.10% (ø) 3371 610 2761
github.com/nvidia/nvsentinel/fault-remediation-module/main.go 0.00% (ø) 349 0 349
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common/equivalence_groups.go 56.52% (ø) 46 26 20
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus/factory.go 80.00% (ø) 15 12 3
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus/reboot.go 63.69% (ø) 168 107 61
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/annotation.go 31.28% (ø) 211 66 145
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/fault_remediation_client_interface.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler.go 37.01% (ø) 481 178 303
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/remediation.go 17.01% (ø) 441 75 366
github.com/nvidia/nvsentinel/health-events-analyzer/main.go 0.00% (ø) 177 0 177
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config/rules.go 0.00% (ø) 7 0 7
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/protos/platformconnector.pb.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher/publisher.go 35.48% (ø) 62 22 40
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler.go 56.77% (ø) 155 88 67
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/config/config.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws/aws.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/gcp/gcp.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/datastore/datastore.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/aws_normalizer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/gcp_normalizer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/processor.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/main.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common/common.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/protos/platformconnector_grpc.pb.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/lsnvlink/topology.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/types/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/csv.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/factory.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/sidecar.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/groupversion_info.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/rebootnode_types.go 70.79% (ø) 89 63 26
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go 35.00% (ø) 100 35 65
github.com/nvidia/nvsentinel/janitor/main.go 0.00% (ø) 63 0 63
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go 77.67% (ø) 103 80 23
github.com/nvidia/nvsentinel/janitor/pkg/controller/test_utils.go 33.33% (ø) 6 2 4
github.com/nvidia/nvsentinel/janitor/pkg/csp/aws.go 0.00% (ø) 84 0 84
github.com/nvidia/nvsentinel/janitor/pkg/csp/azure.go 0.00% (ø) 194 0 194
github.com/nvidia/nvsentinel/janitor/pkg/csp/client.go 0.00% (ø) 14 0 14
github.com/nvidia/nvsentinel/janitor/pkg/csp/gcp.go 0.00% (ø) 126 0 126
github.com/nvidia/nvsentinel/janitor/pkg/csp/kind.go 0.00% (ø) 66 0 66
github.com/nvidia/nvsentinel/janitor/pkg/csp/oci.go 0.00% (ø) 66 0 66
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go 80.00% (ø) 10 8 2
github.com/nvidia/nvsentinel/labeler-module/main.go 0.00% (ø) 54 0 54
github.com/nvidia/nvsentinel/labeler-module/pkg/labeler/labeler.go 71.93% (ø) 171 123 48
github.com/nvidia/nvsentinel/node-drainer-module/main.go 0.00% (ø) 101 0 101
github.com/nvidia/nvsentinel/node-drainer-module/pkg/config/config.go 0.00% (ø) 135 0 135
github.com/nvidia/nvsentinel/node-drainer-module/pkg/evaluator/evaluator.go 44.30% (ø) 158 70 88
github.com/nvidia/nvsentinel/node-drainer-module/pkg/informers/informers.go 32.79% (ø) 613 201 412
github.com/nvidia/nvsentinel/node-drainer-module/pkg/initializer/init.go 0.00% (ø) 77 0 77
github.com/nvidia/nvsentinel/node-drainer-module/pkg/mongodb/event_watcher.go 0.00% (ø) 125 0 125
github.com/nvidia/nvsentinel/node-drainer-module/pkg/queue/queue.go 64.71% (ø) 34 22 12
github.com/nvidia/nvsentinel/node-drainer-module/pkg/queue/worker.go 72.97% (ø) 37 27 10
github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler/reconciler.go 75.83% (ø) 120 91 29
github.com/nvidia/nvsentinel/platform-connectors/main.go 0.00% (ø) 118 0 118
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_connector_impl.go 5.00% (ø) 20 1 19
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/process_node_events.go 92.31% (ø) 195 180 15
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/model.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/storeConnectorImpl.go 48.92% (ø) 139 68 71
github.com/nvidia/nvsentinel/platform-connectors/pkg/protos/platformconnector.pb.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/platform-connectors/pkg/protos/platformconnector_grpc.pb.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer/ring_buffer_impl.go 100.00% (ø) 17 17 0
github.com/nvidia/nvsentinel/platform-connectors/pkg/server/platform_connector_server.go 0.00% (ø) 6 0 6
github.com/nvidia/nvsentinel/statemanager/statemanager.go 92.59% (ø) 54 50 4
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher/watchStore.go 72.69% (-0.93%) 216 157 (-2) 59 (+2) 👎
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tilt/simple-health-client/main.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tilt/simple-health-client/protos/platformconnector.pb.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tilt/simple-health-client/protos/platformconnector_grpc.pb.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/commons/pkg/logger/logger_test.go
  • github.com/nvidia/nvsentinel/commons/pkg/server/server_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation/healthEventsAnnotationMap_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/node_informer_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common/equivalence_groups_test.go
  • github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus/crstatus_test.go
  • github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/remediation_test.go
  • github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws/aws_test.go
  • github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/gcp_normalizer_test.go
  • github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common/common_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/csv_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/suite_test.go
  • github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler/reconciler_integration_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_connector_envtest_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/storeConnectorImpl_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer/ring_buffer_test.go
  • github.com/nvidia/nvsentinel/tests/smoke_test.go

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR addresses a resource leak in the GPU health monitor's DCGM integration by ensuring proper handle cleanup on failure paths. Previously, when the monitor encountered connectivity issues or pod restarts, DCGM handles were deleted via Python's garbage collector without explicitly calling Shutdown(), causing DCGM's internal tracking to retain them as active resources. The fix adds explicit Shutdown() calls before handle deletion and resets all monitoring state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) to None/empty after failures. This ensures the monitoring loop can cleanly re-initialize DCGM resources on the next iteration rather than accumulating zombie handles across restarts.

PR Description Notes:

  • Minor typo: "dgcm handle" should be "DCGM handle" in the testing description

Important Files Changed

Filename Score Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py 4/5 Added explicit Shutdown() call before handle deletion and state-reset logic after connectivity failures to prevent DCGM handle leaks

Confidence score: 4/5

  • This PR is safe to merge with low risk; it fixes a clear resource leak with a straightforward solution
  • Score reflects that the fix follows best practices for resource cleanup and the manual testing demonstrates effectiveness (100 iterations), though the absence of automated tests for this specific failure path and lack of error handling around the Shutdown() call itself prevent a perfect score
  • Pay close attention to health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py lines 263-264and 284-299 to verify the cleanup is comprehensive for all failure scenarios

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@XRFXLP
Copy link
Member

XRFXLP commented Oct 27, 2025

/ok to test b5fdae2

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a critical DCGM handle leak in the GPU health monitor by adding explicit Shutdown() calls before deleting DCGM handles and consistently resetting state variables after cleanup. Previously, when the monitor encountered initialization or connectivity failures, it would create new DCGM handles without properly releasing old ones, causing DCGM's internal handle tracking to accumulate orphaned handles. The fix integrates with the existing retry logic (the while dcgm_handle is None loop on line 277) by ensuring all state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) are reset to their initial values, which is critical because the loop condition checks dcgm_handle is None to determine whether re-initialization is needed.

Important Files Changed

Filename Score Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py 5/5 Adds explicit Shutdown() calls before deleting DCGM handles and resets all state variables to fix resource leaks in error paths

Confidence score: 5/5

  • This PR is safe to merge with minimal risk
  • The fix addresses a well-understood resource leak pattern in the DCGM Python bindings where explicit Shutdown() is required before handle deletion; the state variable resets ensure the retry loop behaves correctly, and the manual testing with 100 pod deletion iterations provides strong empirical evidence of effectiveness
  • No files require special attention

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@nitz2407
Copy link
Contributor Author

/ok to test a4fa913

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR addresses a DCGM handle leak in the GPU health monitor by adding proper resource cleanup on shutdown and failure paths. Previously, when pods were restarted or connectivity failed, DCGM handles were abandoned without explicit release, causing DCGM to retain references and eventually exhaust available handles. The fix adds Shutdown() calls before handle deletion and comprehensive state resets, ensuring clean re-initialization on subsequent attempts. This integrates with the existing error handling pattern where dcgm_handle is None signals the need for re-initialization, and aligns with the monitor's retry logic that attempts reconnection every 30 seconds.

Important Files Changed

Filename Score Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py 5/5 Added explicit Shutdown() call in cleanup function and state reset after initialization/connectivity failures

Confidence score: 5/5

  • This PR is safe to merge with minimal risk
  • Score reflects straightforward resource cleanup logic with clear ownership semantics, validated through 100 pod restart iterations showing consistent handle recreation without exhaustion
  • No files require special attention

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers the entire PR from start to finish.

This PR fixes a critical resource leak in the GPU health monitor's DCGM (NVIDIA Data Center GPU Manager) integration. Previously, when the monitor encountered failures during initialization or runtime, DCGM handles were not properly released, causing handle accumulation that prevented new handle creation. The fix adds three key changes: (1) explicit Shutdown() calls before deleting DCGM handles in _cleanup_dcgm_resources(), (2) state variable resets (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) after cleanup in the initialization error path, and (3) a missing _cleanup_dcgm_resources() call when connectivity failures are detected during health checks. This ensures consistent cleanup behavior across all failure paths—the monitor's main loop relies on dcgm_handle is None checks to trigger re-initialization, so both DCGM-level cleanup and Python variable resets are necessary. The changes integrate cleanly with the existing retry logic in the health monitor's main loop, which already handles re-initialization when handles are null.

Important Files Changed

Filename Score Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py 5/5 Added explicit Shutdown() call in cleanup method and state resets in two failure paths (initialization error and connectivity loss) to prevent DCGM handle accumulation

Confidence score: 5/5

  • This PR is safe to merge with minimal risk—the changes address a well-defined resource leak with a straightforward fix pattern applied consistently across failure paths.
  • Score reflects the focused nature of the fix (three small additions), clear validation (100 pod deletion cycles), and low risk of regression—the changes only affect error paths and add proper cleanup that was missing.
  • No files require special attention—the single file changed contains defensive additions that improve resource management without altering the happy path logic.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@lalitadithya
Copy link
Collaborator

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

Warning

Rate limit exceeded

@lalitadithya has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 2 minutes and 8 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 1e17aea and 1daa2af.

📒 Files selected for processing (1)
  • health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py (2 hunks)
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lalitadithya lalitadithya added this to the v0.3.0 milestone Oct 29, 2025
@lalitadithya lalitadithya enabled auto-merge (squash) October 29, 2025 13:06
@lalitadithya
Copy link
Collaborator

/ok to test e78bc54

@github-actions
Copy link

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/commons/pkg/configmanager 40.00% (ø)
github.com/nvidia/nvsentinel/commons/pkg/statemanager 39.91% (ø)
github.com/nvidia/nvsentinel/data-models/pkg/model 0.00% (ø)
github.com/nvidia/nvsentinel/data-models/pkg/protos 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker 30.55% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/common 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator 41.95% (-0.23%) 👎
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation 44.89% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer 38.52% (-0.08%) 👎
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/metrics 47.37% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/mongodb 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/nodeinfo 0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler 26.82% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module 0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common 56.52% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus 65.03% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler 28.16% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher 35.48% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler 56.77% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/patterns 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/types 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid 0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser 0.00% (ø)
github.com/nvidia/nvsentinel/janitor 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1 31.67% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller 57.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics 60.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1 48.36% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/config 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/evaluator 44.89% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/informers 32.79% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/mongodb 11.90% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler 75.83% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes 84.19% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store 48.92% (-1.44%) 👎
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer 100.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/server 0.00% (ø)
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher 63.75% (ø)
github.com/nvidia/nvsentinel/tests 0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ø)
github.com/nvidia/nvsentinel/tilt/simple-health-client 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/commons/pkg/configmanager/env.go 39.92% (ø) 253 101 152
github.com/nvidia/nvsentinel/commons/pkg/configmanager/loader.go 41.67% (ø) 12 5 7
github.com/nvidia/nvsentinel/commons/pkg/statemanager/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanager.go 40.65% (ø) 214 87 127
github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanagermock.go 0.00% (ø) 4 0 4
github.com/nvidia/nvsentinel/data-models/pkg/model/health_event_extentions.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/data-models/pkg/protos/health_event.pb.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/data-models/pkg/protos/health_event_grpc.pb.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine-module/main.go 0.00% (ø) 223 0 223
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/breaker.go 30.55% (ø) 825 252 573
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/common/common.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/common/healthEventsBuffer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/config/config.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator.go 40.23% (ø) 614 247 367
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go 44.24% (-1.82%) 165 73 (-3) 92 (+3) 👎
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_any.go 47.92% (ø) 48 23 25
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go 44.89% (ø) 421 189 232
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/k8s_client.go 42.24% (ø) 831 351 480
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/k8s_client_interface.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/node_informer.go 31.08% (-0.24%) 415 129 (-1) 286 (+1) 👎
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/initializer/init.go 0.00% (ø) 442 0 442
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/metrics/metrics.go 47.37% (ø) 19 9 10
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/mongodb/event_watcher.go 0.00% (ø) 440 0 440
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/nodeinfo/nodeinfo.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/node_quarantine.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler.go 26.82% (ø) 2189 587 1602
github.com/nvidia/nvsentinel/fault-remediation-module/main.go 0.00% (ø) 349 0 349
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common/equivalence_groups.go 56.52% (ø) 46 26 20
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus/factory.go 80.00% (ø) 15 12 3
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/fault_remediation_client_interface.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler.go 37.01% (ø) 481 178 303
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/remediation.go 17.01% (ø) 441 75 366
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher/publisher.go 35.48% (ø) 62 22 40
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler.go 56.77% (ø) 155 88 67
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws/aws.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/gcp_normalizer.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common/common.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/patterns/xid.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/types/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/csv.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go 1.75% (ø) 57 1 56
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/terminatenode_types.go 55.65% (ø) 115 64 51
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go 24.03% (ø) 516 124 392
github.com/nvidia/nvsentinel/janitor/main.go 0.00% (ø) 277 0 277
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go 0.00% (ø) 43 0 43
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go 56.85% (ø) 197 112 85
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go 60.00% (ø) 20 12 8
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go 48.36% (ø) 275 133 142
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/rebootnode_webhook.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/node-drainer-module/pkg/config/config.go 0.00% (ø) 135 0 135
github.com/nvidia/nvsentinel/node-drainer-module/pkg/evaluator/evaluator.go 44.30% (ø) 158 70 88
github.com/nvidia/nvsentinel/node-drainer-module/pkg/evaluator/types.go 50.00% (ø) 18 9 9
github.com/nvidia/nvsentinel/node-drainer-module/pkg/informers/informers.go 32.79% (ø) 613 201 412
github.com/nvidia/nvsentinel/node-drainer-module/pkg/initializer/init.go 0.00% (ø) 77 0 77
github.com/nvidia/nvsentinel/node-drainer-module/pkg/mongodb/event_watcher.go 0.00% (ø) 127 0 127
github.com/nvidia/nvsentinel/node-drainer-module/pkg/mongodb/helpers.go 48.78% (ø) 41 20 21
github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler/reconciler.go 75.83% (ø) 120 91 29
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_connector.go 5.00% (ø) 20 1 19
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/process_node_events.go 92.31% (ø) 195 180 15
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/store_connector.go 48.92% (-1.44%) 139 68 (-2) 71 (+2) 👎
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer/ring_buffer.go 100.00% (ø) 17 17 0
github.com/nvidia/nvsentinel/platform-connectors/pkg/server/platform_connector_server.go 0.00% (ø) 6 0 6
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher/watch_store.go 72.69% (ø) 216 157 59
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher/watch_store_mock.go 0.00% (ø) 49 0 49
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/tilt/simple-health-client/main.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/commons/pkg/configmanager/env_test.go
  • github.com/nvidia/nvsentinel/commons/pkg/configmanager/loader_test.go
  • github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanager_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/breaker_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/k8s_client_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/node_informer_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/nodeinfo/nodeinfo_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/node_quarantine_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common/equivalence_groups_test.go
  • github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus/crstatus_test.go
  • github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/remediation_test.go
  • github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws/aws_test.go
  • github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/gcp_normalizer_test.go
  • github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common/common_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/csv_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/suite_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/rebootnode_webhook_test.go
  • github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler/reconciler_integration_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_connector_envtest_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/store_connector_test.go
  • github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer/ring_buffer_test.go
  • github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher/watch_store_test.go
  • github.com/nvidia/nvsentinel/tests/smoke_test.go

@lalitadithya lalitadithya merged commit 1cc57eb into NVIDIA:main Oct 29, 2025
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants