fix: resource cleanup and state reset for dcgm handle after failures … #76

nitz2407 · 2025-10-17T18:11:37Z

Summary

cleanup older dcgm handle which are not getting used in gpu-health-monitor as whenever new handle is created older handle is not getting garbage collected due to which dcgm thinks that those handle is valid.

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Fix is working fine as tried 100 iterations of deletion gpu-health-monitor pods and everytime dgcm handle created successfully.

copy-pr-bot · 2025-10-17T18:11:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dims · 2025-10-17T18:14:44Z

/ok to test 9006aff

lalitadithya · 2025-10-18T20:51:47Z

please rebase

nitz2407 · 2025-10-22T04:40:26Z

please rebase

Done

lalitadithya · 2025-10-22T06:22:36Z

Fix is working fine as tried 100 iterations of deletion gpu-health-monitor pods and everytime dgcm handle created successfully.

Can you try deleting DCGM, waiting for the connectivity check to fail and then reenable DCGM? I think that is another way in which this issue can be reproduced.

nitz2407 · 2025-10-24T12:19:17Z

issue

Tried couple of iterations for dcgm deletion and dcgm handle created successfully after dcgm pod get restore.

…or connectivity issues Signed-off-by: Nitin Jain <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes a critical resource leak in the GPU health monitor's DCGM (Data Center GPU Manager) handle management. The issue occurred when the gpu-health-monitor pod restarted or experienced connectivity failures—old DCGM handles were not properly released, causing DCGM to maintain stale references that blocked new handle creation. The fix adds an explicit dcgm_handle.Shutdown() call before deleting handles (ensuring the DCGM library releases internal resources) and resets all related state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) after initialization or connectivity failures. This ensures clean retry attempts by preventing the code from operating with partially initialized state. The change integrates with the existing retry loop in the _setup_dcgm method, which checks if dcgm_handle is None to decide whether reinitialization is needed.

Important Files Changed

Filename	Score	Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py	4/5	Added `Shutdown()` call before handle deletion and full state reset after initialization/connectivity failures to prevent DCGM handle leaks

Confidence score: 4/5

This PR is safe to merge with low risk of production issues
Score reflects thorough manual testing (100 iterations) and a straightforward fix to a well-isolated resource leak, though the lack of automated tests for this specific failure path and the addition of multiple state resets in error handling paths introduce minor risk
The _cleanup_dcgm_resources method (lines 263-264) requires close attention to ensure the shutdown order is correct and that no exceptions during shutdown could leave partial state

Sequence Diagram

sequenceDiagram
    participant User
    participant DCGMWatcher
    participant ThreadPoolExecutor
    participant DCGM as "DCGM (pydcgm)"
    participant Callbacks as "Callback Functions"
    
    User->>DCGMWatcher: "start(fields_to_monitor, exit_event)"
    
    loop Until exit event is set
        DCGMWatcher->>DCGMWatcher: "Check if dcgm_handle is None"
        
        alt dcgm_handle is None
            DCGMWatcher->>DCGM: "_get_dcgm_handle()"
            alt Handle creation successful
                DCGM-->>DCGMWatcher: "Return dcgm_handle"
                DCGMWatcher->>DCGM: "_initialize_dcgm_monitoring()"
                DCGM->>DCGM: "GetEntityGroupEntities(GPU)"
                DCGM->>DCGM: "GetEntityGroupEntities(SWITCH)"
                DCGM->>DCGM: "Create DcgmGroup"
                DCGM->>DCGM: "Set health watches"
                DCGM->>DCGM: "Get GPU serial numbers"
                DCGM-->>DCGMWatcher: "Return dcgm_group, gpu_ids, gpu_serials"
            else Handle creation failed
                DCGM-->>DCGMWatcher: "Raise exception"
                DCGMWatcher->>DCGMWatcher: "_cleanup_dcgm_resources()"
                DCGMWatcher->>ThreadPoolExecutor: "_fire_callback_funcs(dcgm_connectivity_failed)"
                ThreadPoolExecutor->>Callbacks: "dcgm_connectivity_failed()"
                DCGMWatcher->>DCGMWatcher: "Reset state (handle, group, ids, serials)"
            end
        else dcgm_handle exists
            DCGMWatcher->>DCGM: "_perform_health_check(dcgm_group)"
            alt Health check successful
                DCGM->>DCGM: "dcgm_group.health.Check()"
                DCGM-->>DCGMWatcher: "Return health_details, connectivity_success=True"
                DCGMWatcher->>DCGMWatcher: "Process incidents and accumulate failures"
                DCGMWatcher->>ThreadPoolExecutor: "_fire_callback_funcs(health_event_occurred)"
                ThreadPoolExecutor->>Callbacks: "health_event_occurred(health_status, gpu_ids, gpu_serials)"
            else Health check failed (timeout/error)
                DCGM-->>DCGMWatcher: "Return empty health_status, connectivity_success=False"
                DCGMWatcher->>DCGMWatcher: "_cleanup_dcgm_resources()"
                DCGMWatcher->>DCGM: "Shutdown and delete handle"
                DCGMWatcher->>DCGMWatcher: "Reset state (handle, group, ids, serials)"
            end
        end
        
        DCGMWatcher->>DCGMWatcher: "Wait for poll_interval_seconds"
    end
    
    DCGMWatcher->>DCGMWatcher: "_cleanup_dcgm_resources()"
    DCGMWatcher->>ThreadPoolExecutor: "shutdown(cancel_futures=True)"
    DCGMWatcher-->>User: "Exit monitoring loop"

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes a DCGM handle leak in the gpu-health-monitor component that occurred when monitor pods were restarted. The core issue was that DCGM handle objects were being deleted in Python without calling the underlying Shutdown() method, causing DCGM's internal state to retain references to stale handles. The fix adds explicit dcgm_handle.Shutdown() calls before deletion and ensures all state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) are consistently reset after cleanup on both initialization and connectivity failure paths. This integrates with the existing _cleanup_dcgm_resources() helper method in the DCGM watcher module, which manages the lifecycle of DCGM group objects.

Important Files Changed

Filename	Score	Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py	5/5	Adds explicit DCGM handle shutdown and state reset to fix resource leak on pod restart

Confidence score: 5/5

This PR is safe to merge with minimal risk
The changes follow a well-established cleanup pattern, add proper resource lifecycle management with explicit Shutdown() calls, and include state resets on all failure paths; the fix has been validated through 100 iterations of pod deletion/recreation
No files require special attention

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. This PR fixes a critical resource leak in the GPU health monitor's DCGM (Data Center GPU Manager) handle management. When the monitor encountered failures and attempted to recreate DCGM connections, old handles were not being properly released, causing DCGM to retain stale references that prevented new handle creation. The fix adds explicit Shutdown() calls on handles before deletion and resets all state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) to None/empty after cleanup. This ensures clean recovery from transient failures during both initialization and connectivity checks. The change integrates with the existing _cleanup_dcgm_resources helper method and follows the established error-handling patterns in health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py.

Important Files Changed

Filename	Score	Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py	5/5	Adds DCGM handle cleanup via `Shutdown()` and state variable resets on initialization and connectivity failures

Confidence score: 5/5

This PR is safe to merge with minimal risk and fixes a critical reliability issue in production.
Score reflects a targeted fix with clear before/after behavior, successful manual validation (100 pod deletion iterations), and no breaking changes—the fix only adds proper cleanup where it was missing.
No files require special attention; the change is well-scoped to error-handling paths that were previously incomplete.

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

nitz2407 · 2025-10-27T10:50:58Z

/ok to test 6d40f23

XRFXLP · 2025-10-27T10:52:10Z

/ok to test 6d40f23

github-actions · 2025-10-27T11:03:22Z

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/commons/pkg/logger	0.00% (ø)
github.com/nvidia/nvsentinel/commons/pkg/server	0.00% (ø)
github.com/nvidia/nvsentinel/data-models/pkg/model	0.00% (ø)
github.com/nvidia/nvsentinel/data-models/pkg/protos	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker	25.90% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/common	1.27% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator	25.55% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation	37.67% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer	38.99% (+0.22%)	👍
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/nodeinfo	48.86% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler	18.21% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common	56.52% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus	65.03% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler	28.16% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer	0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/protos	0.00% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher	35.48% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler	56.77% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/cmd/csp-health-monitor	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/cmd/maintenance-notifier	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/gcp	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/datastore	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/protos	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/lsnvlink	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/types	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser	0.00% (ø)
github.com/nvidia/nvsentinel/janitor	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1	51.85% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller	75.23% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/csp	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics	80.00% (ø)
github.com/nvidia/nvsentinel/labeler-module	0.00% (ø)
github.com/nvidia/nvsentinel/labeler-module/pkg/labeler	71.93% (ø)
github.com/nvidia/nvsentinel/node-drainer-module	0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/evaluator	44.89% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/informers	32.79% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/initializer	0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/mongodb	12.05% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/queue	69.01% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler	75.83% (ø)
github.com/nvidia/nvsentinel/platform-connectors	0.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes	84.19% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store	48.92% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/protos	0.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer	100.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/server	0.00% (ø)
github.com/nvidia/nvsentinel/statemanager	90.91% (ø)
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher	75.28% (-0.74%)	👎
github.com/nvidia/nvsentinel/tests	0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers	0.00% (ø)
github.com/nvidia/nvsentinel/tilt/simple-health-client	0.00% (ø)
github.com/nvidia/nvsentinel/tilt/simple-health-client/protos	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/commons/pkg/logger/logger.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/commons/pkg/server/server.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/data-models/pkg/model/maintenance_event.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/data-models/pkg/protos/health_event.pb.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/data-models/pkg/protos/health_event_grpc.pb.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine-module/main.go	0.00% (ø)	630	0	630
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/breaker.go	25.60% (ø)	1004	257	747
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/common/healthEventsBuffer.go	1.27% (ø)	79	1	78
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator.go	24.03% (ø)	670	161	509
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go	26.77% (+0.51%)	198	53 (+1)	145 (-1)	👍
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_all.go	32.65% (ø)	49	16	33
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_any.go	30.51% (-3.39%)	59	18 (-2)	41 (+2)	👎
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_iface.go	33.33% (+5.56%)	18	6 (+1)	12 (-1)	👍
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation/healthEventsAnnotationMap.go	37.67% (ø)	507	191	316
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/node_informer.go	38.99% (+0.22%)	908	354 (+2)	554 (-2)	👍
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/nodeinfo/nodeinfo.go	48.86% (ø)	219	107	112
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/node_quarantine.go	18.59% (ø)	1065	198	867
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler.go	18.10% (ø)	3371	610	2761
github.com/nvidia/nvsentinel/fault-remediation-module/main.go	0.00% (ø)	349	0	349
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common/equivalence_groups.go	56.52% (ø)	46	26	20
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus/factory.go	80.00% (ø)	15	12	3
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus/reboot.go	63.69% (ø)	168	107	61
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/annotation.go	31.28% (ø)	211	66	145
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/fault_remediation_client_interface.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler.go	37.01% (ø)	481	178	303
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/remediation.go	17.01% (ø)	441	75	366
github.com/nvidia/nvsentinel/health-events-analyzer/main.go	0.00% (ø)	177	0	177
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/config/rules.go	0.00% (ø)	7	0	7
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/protos/platformconnector.pb.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher/publisher.go	35.48% (ø)	62	22	40
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler.go	56.77% (ø)	155	88	67
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/config/config.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws/aws.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/gcp/gcp.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/datastore/datastore.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/aws_normalizer.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/gcp_normalizer.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/processor.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/main.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common/common.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/protos/platformconnector_grpc.pb.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/lsnvlink/topology.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/types/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/csv.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/factory.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/sidecar.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/groupversion_info.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/rebootnode_types.go	70.79% (ø)	89	63	26
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go	35.00% (ø)	100	35	65
github.com/nvidia/nvsentinel/janitor/main.go	0.00% (ø)	63	0	63
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go	77.67% (ø)	103	80	23
github.com/nvidia/nvsentinel/janitor/pkg/controller/test_utils.go	33.33% (ø)	6	2	4
github.com/nvidia/nvsentinel/janitor/pkg/csp/aws.go	0.00% (ø)	84	0	84
github.com/nvidia/nvsentinel/janitor/pkg/csp/azure.go	0.00% (ø)	194	0	194
github.com/nvidia/nvsentinel/janitor/pkg/csp/client.go	0.00% (ø)	14	0	14
github.com/nvidia/nvsentinel/janitor/pkg/csp/gcp.go	0.00% (ø)	126	0	126
github.com/nvidia/nvsentinel/janitor/pkg/csp/kind.go	0.00% (ø)	66	0	66
github.com/nvidia/nvsentinel/janitor/pkg/csp/oci.go	0.00% (ø)	66	0	66
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go	80.00% (ø)	10	8	2
github.com/nvidia/nvsentinel/labeler-module/main.go	0.00% (ø)	54	0	54
github.com/nvidia/nvsentinel/labeler-module/pkg/labeler/labeler.go	71.93% (ø)	171	123	48
github.com/nvidia/nvsentinel/node-drainer-module/main.go	0.00% (ø)	101	0	101
github.com/nvidia/nvsentinel/node-drainer-module/pkg/config/config.go	0.00% (ø)	135	0	135
github.com/nvidia/nvsentinel/node-drainer-module/pkg/evaluator/evaluator.go	44.30% (ø)	158	70	88
github.com/nvidia/nvsentinel/node-drainer-module/pkg/informers/informers.go	32.79% (ø)	613	201	412
github.com/nvidia/nvsentinel/node-drainer-module/pkg/initializer/init.go	0.00% (ø)	77	0	77
github.com/nvidia/nvsentinel/node-drainer-module/pkg/mongodb/event_watcher.go	0.00% (ø)	125	0	125
github.com/nvidia/nvsentinel/node-drainer-module/pkg/queue/queue.go	64.71% (ø)	34	22	12
github.com/nvidia/nvsentinel/node-drainer-module/pkg/queue/worker.go	72.97% (ø)	37	27	10
github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler/reconciler.go	75.83% (ø)	120	91	29
github.com/nvidia/nvsentinel/platform-connectors/main.go	0.00% (ø)	118	0	118
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_connector_impl.go	5.00% (ø)	20	1	19
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/process_node_events.go	92.31% (ø)	195	180	15
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/model.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/storeConnectorImpl.go	48.92% (ø)	139	68	71
github.com/nvidia/nvsentinel/platform-connectors/pkg/protos/platformconnector.pb.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/platform-connectors/pkg/protos/platformconnector_grpc.pb.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer/ring_buffer_impl.go	100.00% (ø)	17	17	0
github.com/nvidia/nvsentinel/platform-connectors/pkg/server/platform_connector_server.go	0.00% (ø)	6	0	6
github.com/nvidia/nvsentinel/statemanager/statemanager.go	92.59% (ø)	54	50	4
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher/watchStore.go	72.69% (-0.93%)	216	157 (-2)	59 (+2)	👎
github.com/nvidia/nvsentinel/tests/helpers/kube.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tilt/simple-health-client/main.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tilt/simple-health-client/protos/platformconnector.pb.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tilt/simple-health-client/protos/platformconnector_grpc.pb.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/commons/pkg/logger/logger_test.go
github.com/nvidia/nvsentinel/commons/pkg/server/server_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation/healthEventsAnnotationMap_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/node_informer_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common/equivalence_groups_test.go
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus/crstatus_test.go
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler_e2e_test.go
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/remediation_test.go
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws/aws_test.go
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/gcp_normalizer_test.go
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common/common_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/csv_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/suite_test.go
github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler/reconciler_integration_test.go
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_connector_envtest_test.go
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/storeConnectorImpl_test.go
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer/ring_buffer_test.go
github.com/nvidia/nvsentinel/tests/smoke_test.go

greptile-apps

Greptile Overview

Greptile Summary

This PR addresses a resource leak in the GPU health monitor's DCGM integration by ensuring proper handle cleanup on failure paths. Previously, when the monitor encountered connectivity issues or pod restarts, DCGM handles were deleted via Python's garbage collector without explicitly calling Shutdown(), causing DCGM's internal tracking to retain them as active resources. The fix adds explicit Shutdown() calls before handle deletion and resets all monitoring state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) to None/empty after failures. This ensures the monitoring loop can cleanly re-initialize DCGM resources on the next iteration rather than accumulating zombie handles across restarts.

PR Description Notes:

Minor typo: "dgcm handle" should be "DCGM handle" in the testing description

Important Files Changed

Filename	Score	Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py	4/5	Added explicit `Shutdown()` call before handle deletion and state-reset logic after connectivity failures to prevent DCGM handle leaks

Confidence score: 4/5

This PR is safe to merge with low risk; it fixes a clear resource leak with a straightforward solution
Score reflects that the fix follows best practices for resource cleanup and the manual testing demonstrates effectiveness (100 iterations), though the absence of automated tests for this specific failure path and lack of error handling around the Shutdown() call itself prevent a perfect score
Pay close attention to health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py lines 263-264and 284-299 to verify the cleanup is comprehensive for all failure scenarios

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

XRFXLP · 2025-10-27T11:19:41Z

/ok to test b5fdae2

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes a critical DCGM handle leak in the GPU health monitor by adding explicit Shutdown() calls before deleting DCGM handles and consistently resetting state variables after cleanup. Previously, when the monitor encountered initialization or connectivity failures, it would create new DCGM handles without properly releasing old ones, causing DCGM's internal handle tracking to accumulate orphaned handles. The fix integrates with the existing retry logic (the while dcgm_handle is None loop on line 277) by ensuring all state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) are reset to their initial values, which is critical because the loop condition checks dcgm_handle is None to determine whether re-initialization is needed.

Important Files Changed

Filename	Score	Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py	5/5	Adds explicit `Shutdown()` calls before deleting DCGM handles and resets all state variables to fix resource leaks in error paths

Confidence score: 5/5

This PR is safe to merge with minimal risk
The fix addresses a well-understood resource leak pattern in the DCGM Python bindings where explicit Shutdown() is required before handle deletion; the state variable resets ensure the retry loop behaves correctly, and the manual testing with 100 pod deletion iterations provides strong empirical evidence of effectiveness
No files require special attention

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

nitz2407 · 2025-10-27T17:35:12Z

/ok to test a4fa913

greptile-apps

Greptile Overview

Greptile Summary

This PR addresses a DCGM handle leak in the GPU health monitor by adding proper resource cleanup on shutdown and failure paths. Previously, when pods were restarted or connectivity failed, DCGM handles were abandoned without explicit release, causing DCGM to retain references and eventually exhaust available handles. The fix adds Shutdown() calls before handle deletion and comprehensive state resets, ensuring clean re-initialization on subsequent attempts. This integrates with the existing error handling pattern where dcgm_handle is None signals the need for re-initialization, and aligns with the monitor's retry logic that attempts reconnection every 30 seconds.

Important Files Changed

Filename	Score	Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py	5/5	Added explicit `Shutdown()` call in cleanup function and state reset after initialization/connectivity failures

Confidence score: 5/5

This PR is safe to merge with minimal risk
Score reflects straightforward resource cleanup logic with clear ownership semantics, validated through 100 pod restart iterations showing consistent handle recreation without exhaustion
No files require special attention

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Greptile Overview

Greptile Summary

This review covers the entire PR from start to finish.

This PR fixes a critical resource leak in the GPU health monitor's DCGM (NVIDIA Data Center GPU Manager) integration. Previously, when the monitor encountered failures during initialization or runtime, DCGM handles were not properly released, causing handle accumulation that prevented new handle creation. The fix adds three key changes: (1) explicit Shutdown() calls before deleting DCGM handles in _cleanup_dcgm_resources(), (2) state variable resets (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) after cleanup in the initialization error path, and (3) a missing _cleanup_dcgm_resources() call when connectivity failures are detected during health checks. This ensures consistent cleanup behavior across all failure paths—the monitor's main loop relies on dcgm_handle is None checks to trigger re-initialization, so both DCGM-level cleanup and Python variable resets are necessary. The changes integrate cleanly with the existing retry logic in the health monitor's main loop, which already handles re-initialization when handles are null.

Important Files Changed

Filename	Score	Overview
health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py	5/5	Added explicit `Shutdown()` call in cleanup method and state resets in two failure paths (initialization error and connectivity loss) to prevent DCGM handle accumulation

Confidence score: 5/5

This PR is safe to merge with minimal risk—the changes address a well-defined resource leak with a straightforward fix pattern applied consistently across failure paths.
Score reflects the focused nature of the fix (three small additions), clear validation (100 pod deletion cycles), and low risk of regression—the changes only affect error paths and add proper cleanup that was missing.
No files require special attention—the single file changed contains defensive additions that improve resource management without altering the happy path logic.

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

lalitadithya · 2025-10-28T20:45:36Z

@coderabbitai full review

coderabbitai · 2025-10-28T20:46:00Z

✅ Actions performed

Full review triggered.

coderabbitai · 2025-10-28T20:46:03Z

Warning

Rate limit exceeded

@lalitadithya has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 2 minutes and 8 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 1e17aea and 1daa2af.

📒 Files selected for processing (1)

health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py (2 hunks)

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lalitadithya · 2025-10-29T13:07:13Z

/ok to test e78bc54

github-actions · 2025-10-29T13:16:59Z

Merging this branch will decrease overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/commons/pkg/configmanager	40.00% (ø)
github.com/nvidia/nvsentinel/commons/pkg/statemanager	39.91% (ø)
github.com/nvidia/nvsentinel/data-models/pkg/model	0.00% (ø)
github.com/nvidia/nvsentinel/data-models/pkg/protos	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker	30.55% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/common	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator	41.95% (-0.23%)	👎
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation	44.89% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer	38.52% (-0.08%)	👎
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/initializer	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/metrics	47.37% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/mongodb	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/nodeinfo	0.00% (ø)
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler	26.82% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module	0.00% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common	56.52% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus	65.03% (ø)
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler	28.16% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher	35.48% (ø)
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler	56.77% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/patterns	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/types	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid	0.00% (ø)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser	0.00% (ø)
github.com/nvidia/nvsentinel/janitor	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1	31.67% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/controller	57.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/metrics	60.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1	48.36% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/config	0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/evaluator	44.89% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/informers	32.79% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/initializer	0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/mongodb	11.90% (ø)
github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler	75.83% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes	84.19% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store	48.92% (-1.44%)	👎
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer	100.00% (ø)
github.com/nvidia/nvsentinel/platform-connectors/pkg/server	0.00% (ø)
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher	63.75% (ø)
github.com/nvidia/nvsentinel/tests	0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers	0.00% (ø)
github.com/nvidia/nvsentinel/tilt/simple-health-client	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/commons/pkg/configmanager/env.go	39.92% (ø)	253	101	152
github.com/nvidia/nvsentinel/commons/pkg/configmanager/loader.go	41.67% (ø)	12	5	7
github.com/nvidia/nvsentinel/commons/pkg/statemanager/metrics.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanager.go	40.65% (ø)	214	87	127
github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanagermock.go	0.00% (ø)	4	0	4
github.com/nvidia/nvsentinel/data-models/pkg/model/health_event_extentions.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/data-models/pkg/protos/health_event.pb.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/data-models/pkg/protos/health_event_grpc.pb.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine-module/main.go	0.00% (ø)	223	0	223
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/breaker.go	30.55% (ø)	825	252	573
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/metrics.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/common/common.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/common/healthEventsBuffer.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/config/config.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator.go	40.23% (ø)	614	247	367
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go	44.24% (-1.82%)	165	73 (-3)	92 (+3)	👎
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_any.go	47.92% (ø)	48	23	25
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go	44.89% (ø)	421	189	232
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/k8s_client.go	42.24% (ø)	831	351	480
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/k8s_client_interface.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/node_informer.go	31.08% (-0.24%)	415	129 (-1)	286 (+1)	👎
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/initializer/init.go	0.00% (ø)	442	0	442
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/metrics/metrics.go	47.37% (ø)	19	9	10
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/mongodb/event_watcher.go	0.00% (ø)	440	0	440
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/nodeinfo/nodeinfo.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/node_quarantine.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler.go	26.82% (ø)	2189	587	1602
github.com/nvidia/nvsentinel/fault-remediation-module/main.go	0.00% (ø)	349	0	349
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common/equivalence_groups.go	56.52% (ø)	46	26	20
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus/factory.go	80.00% (ø)	15	12	3
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/fault_remediation_client_interface.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler.go	37.01% (ø)	481	178	303
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/remediation.go	17.01% (ø)	441	75	366
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/publisher/publisher.go	35.48% (ø)	62	22	40
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler.go	56.77% (ø)	155	88	67
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws/aws.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/gcp_normalizer.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common/common.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/metrics.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/patterns/xid.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/types/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/csv.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go	1.75% (ø)	57	1	56
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/terminatenode_types.go	55.65% (ø)	115	64	51
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go	24.03% (ø)	516	124	392
github.com/nvidia/nvsentinel/janitor/main.go	0.00% (ø)	277	0	277
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go	0.00% (ø)	43	0	43
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go	56.85% (ø)	197	112	85
github.com/nvidia/nvsentinel/janitor/pkg/metrics/metrics.go	60.00% (ø)	20	12	8
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook.go	48.36% (ø)	275	133	142
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/rebootnode_webhook.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/node-drainer-module/pkg/config/config.go	0.00% (ø)	135	0	135
github.com/nvidia/nvsentinel/node-drainer-module/pkg/evaluator/evaluator.go	44.30% (ø)	158	70	88
github.com/nvidia/nvsentinel/node-drainer-module/pkg/evaluator/types.go	50.00% (ø)	18	9	9
github.com/nvidia/nvsentinel/node-drainer-module/pkg/informers/informers.go	32.79% (ø)	613	201	412
github.com/nvidia/nvsentinel/node-drainer-module/pkg/initializer/init.go	0.00% (ø)	77	0	77
github.com/nvidia/nvsentinel/node-drainer-module/pkg/mongodb/event_watcher.go	0.00% (ø)	127	0	127
github.com/nvidia/nvsentinel/node-drainer-module/pkg/mongodb/helpers.go	48.78% (ø)	41	20	21
github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler/reconciler.go	75.83% (ø)	120	91	29
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_connector.go	5.00% (ø)	20	1	19
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/process_node_events.go	92.31% (ø)	195	180	15
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/store_connector.go	48.92% (-1.44%)	139	68 (-2)	71 (+2)	👎
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer/ring_buffer.go	100.00% (ø)	17	17	0
github.com/nvidia/nvsentinel/platform-connectors/pkg/server/platform_connector_server.go	0.00% (ø)	6	0	6
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher/watch_store.go	72.69% (ø)	216	157	59
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher/watch_store_mock.go	0.00% (ø)	49	0	49
github.com/nvidia/nvsentinel/tests/helpers/kube.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/tilt/simple-health-client/main.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/commons/pkg/configmanager/env_test.go
github.com/nvidia/nvsentinel/commons/pkg/configmanager/loader_test.go
github.com/nvidia/nvsentinel/commons/pkg/statemanager/statemanager_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/breaker_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/k8s_client_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/node_informer_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/nodeinfo/nodeinfo_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/node_quarantine_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/common/equivalence_groups_test.go
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/crstatus/crstatus_test.go
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler_e2e_test.go
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/fault-remediation-module/pkg/reconciler/remediation_test.go
github.com/nvidia/nvsentinel/health-events-analyzer/pkg/reconciler/reconciler_test.go
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/csp/aws/aws_test.go
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/event/gcp_normalizer_test.go
github.com/nvidia/nvsentinel/health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/common/common_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/parser/csv_test.go
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/suite_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller_test.go
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/janitor_webhook_test.go
github.com/nvidia/nvsentinel/janitor/pkg/webhook/v1alpha1/rebootnode_webhook_test.go
github.com/nvidia/nvsentinel/node-drainer-module/pkg/reconciler/reconciler_integration_test.go
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_connector_envtest_test.go
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/store/store_connector_test.go
github.com/nvidia/nvsentinel/platform-connectors/pkg/ringbuffer/ring_buffer_test.go
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher/watch_store_test.go
github.com/nvidia/nvsentinel/tests/smoke_test.go

dims force-pushed the nitijain/kace-303 branch from 9006aff to 57dd3c7 Compare October 17, 2025 18:16

nitz2407 force-pushed the nitijain/kace-303 branch from 57dd3c7 to 8dc47f4 Compare October 17, 2025 18:25

nitz2407 force-pushed the nitijain/kace-303 branch from 8dc47f4 to 87b0b34 Compare October 22, 2025 04:31

fix: resource cleanup and state reset for dcgm handle after failures …

fd9f319

…or connectivity issues Signed-off-by: Nitin Jain <[email protected]>

nitz2407 force-pushed the nitijain/kace-303 branch from 87b0b34 to fd9f319 Compare October 24, 2025 12:23

greptile-apps bot reviewed Oct 24, 2025

View reviewed changes

Merge branch 'main' into nitijain/kace-303

fde9222

greptile-apps bot reviewed Oct 24, 2025

View reviewed changes

Merge branch 'main' into nitijain/kace-303

6d40f23

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

Merge branch 'main' into nitijain/kace-303

b5fdae2

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

Merge branch 'main' into nitijain/kace-303

35d1550

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

Merge branch 'main' into nitijain/kace-303

a4fa913

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

lalitadithya approved these changes Oct 28, 2025

View reviewed changes

Merge branch 'main' into nitijain/kace-303

1daa2af

greptile-apps bot reviewed Oct 28, 2025

View reviewed changes

lalitadithya added this to the v0.3.0 milestone Oct 29, 2025

nitz2407 and others added 2 commits October 29, 2025 13:53

Merge branch 'main' into nitijain/kace-303

0af665d

Merge branch 'main' into nitijain/kace-303

e78bc54

lalitadithya enabled auto-merge (squash) October 29, 2025 13:06

lalitadithya merged commit 1cc57eb into NVIDIA:main Oct 29, 2025
42 checks passed

fix: resource cleanup and state reset for dcgm handle after failures … #76

fix: resource cleanup and state reset for dcgm handle after failures … #76

Uh oh!

Conversation

nitz2407 commented Oct 17, 2025

Summary

Type of Change

Component(s) Affected

Testing

Checklist

Uh oh!

copy-pr-bot bot commented Oct 17, 2025

Uh oh!

dims commented Oct 17, 2025

Uh oh!

lalitadithya commented Oct 18, 2025

Uh oh!

nitz2407 commented Oct 22, 2025

Uh oh!

lalitadithya commented Oct 22, 2025

Uh oh!

nitz2407 commented Oct 24, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Uh oh!

nitz2407 commented Oct 27, 2025

Uh oh!

XRFXLP commented Oct 27, 2025

Uh oh!

github-actions bot commented Oct 27, 2025

Merging this branch changes the coverage (1 decrease, 1 increase)

Changed files (no unit tests)

Changed unit test files

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Uh oh!

XRFXLP commented Oct 27, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Uh oh!

nitz2407 commented Oct 27, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Uh oh!

greptile-apps bot left a comment