-
Notifications
You must be signed in to change notification settings - Fork 19
fix: resource cleanup and state reset for dcgm handle after failures … #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/ok to test 9006aff |
9006aff to
57dd3c7
Compare
57dd3c7 to
8dc47f4
Compare
|
please rebase |
8dc47f4 to
87b0b34
Compare
Done |
Can you try deleting DCGM, waiting for the connectivity check to fail and then reenable DCGM? I think that is another way in which this issue can be reproduced. |
…or connectivity issues Signed-off-by: Nitin Jain <[email protected]>
87b0b34 to
fd9f319
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical resource leak in the GPU health monitor's DCGM (Data Center GPU Manager) handle management. The issue occurred when the gpu-health-monitor pod restarted or experienced connectivity failures—old DCGM handles were not properly released, causing DCGM to maintain stale references that blocked new handle creation. The fix adds an explicit dcgm_handle.Shutdown() call before deleting handles (ensuring the DCGM library releases internal resources) and resets all related state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) after initialization or connectivity failures. This ensures clean retry attempts by preventing the code from operating with partially initialized state. The change integrates with the existing retry loop in the _setup_dcgm method, which checks if dcgm_handle is None to decide whether reinitialization is needed.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py | 4/5 | Added Shutdown() call before handle deletion and full state reset after initialization/connectivity failures to prevent DCGM handle leaks |
Confidence score: 4/5
- This PR is safe to merge with low risk of production issues
- Score reflects thorough manual testing (100 iterations) and a straightforward fix to a well-isolated resource leak, though the lack of automated tests for this specific failure path and the addition of multiple state resets in error handling paths introduce minor risk
- The
_cleanup_dcgm_resourcesmethod (lines 263-264) requires close attention to ensure the shutdown order is correct and that no exceptions during shutdown could leave partial state
Sequence Diagram
sequenceDiagram
participant User
participant DCGMWatcher
participant ThreadPoolExecutor
participant DCGM as "DCGM (pydcgm)"
participant Callbacks as "Callback Functions"
User->>DCGMWatcher: "start(fields_to_monitor, exit_event)"
loop Until exit event is set
DCGMWatcher->>DCGMWatcher: "Check if dcgm_handle is None"
alt dcgm_handle is None
DCGMWatcher->>DCGM: "_get_dcgm_handle()"
alt Handle creation successful
DCGM-->>DCGMWatcher: "Return dcgm_handle"
DCGMWatcher->>DCGM: "_initialize_dcgm_monitoring()"
DCGM->>DCGM: "GetEntityGroupEntities(GPU)"
DCGM->>DCGM: "GetEntityGroupEntities(SWITCH)"
DCGM->>DCGM: "Create DcgmGroup"
DCGM->>DCGM: "Set health watches"
DCGM->>DCGM: "Get GPU serial numbers"
DCGM-->>DCGMWatcher: "Return dcgm_group, gpu_ids, gpu_serials"
else Handle creation failed
DCGM-->>DCGMWatcher: "Raise exception"
DCGMWatcher->>DCGMWatcher: "_cleanup_dcgm_resources()"
DCGMWatcher->>ThreadPoolExecutor: "_fire_callback_funcs(dcgm_connectivity_failed)"
ThreadPoolExecutor->>Callbacks: "dcgm_connectivity_failed()"
DCGMWatcher->>DCGMWatcher: "Reset state (handle, group, ids, serials)"
end
else dcgm_handle exists
DCGMWatcher->>DCGM: "_perform_health_check(dcgm_group)"
alt Health check successful
DCGM->>DCGM: "dcgm_group.health.Check()"
DCGM-->>DCGMWatcher: "Return health_details, connectivity_success=True"
DCGMWatcher->>DCGMWatcher: "Process incidents and accumulate failures"
DCGMWatcher->>ThreadPoolExecutor: "_fire_callback_funcs(health_event_occurred)"
ThreadPoolExecutor->>Callbacks: "health_event_occurred(health_status, gpu_ids, gpu_serials)"
else Health check failed (timeout/error)
DCGM-->>DCGMWatcher: "Return empty health_status, connectivity_success=False"
DCGMWatcher->>DCGMWatcher: "_cleanup_dcgm_resources()"
DCGMWatcher->>DCGM: "Shutdown and delete handle"
DCGMWatcher->>DCGMWatcher: "Reset state (handle, group, ids, serials)"
end
end
DCGMWatcher->>DCGMWatcher: "Wait for poll_interval_seconds"
end
DCGMWatcher->>DCGMWatcher: "_cleanup_dcgm_resources()"
DCGMWatcher->>ThreadPoolExecutor: "shutdown(cancel_futures=True)"
DCGMWatcher-->>User: "Exit monitoring loop"
1 file reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a DCGM handle leak in the gpu-health-monitor component that occurred when monitor pods were restarted. The core issue was that DCGM handle objects were being deleted in Python without calling the underlying Shutdown() method, causing DCGM's internal state to retain references to stale handles. The fix adds explicit dcgm_handle.Shutdown() calls before deletion and ensures all state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) are consistently reset after cleanup on both initialization and connectivity failure paths. This integrates with the existing _cleanup_dcgm_resources() helper method in the DCGM watcher module, which manages the lifecycle of DCGM group objects.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py | 5/5 | Adds explicit DCGM handle shutdown and state reset to fix resource leak on pod restart |
Confidence score: 5/5
- This PR is safe to merge with minimal risk
- The changes follow a well-established cleanup pattern, add proper resource lifecycle management with explicit
Shutdown()calls, and include state resets on all failure paths; the fix has been validated through 100 iterations of pod deletion/recreation - No files require special attention
1 file reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. This PR fixes a critical resource leak in the GPU health monitor's DCGM (Data Center GPU Manager) handle management. When the monitor encountered failures and attempted to recreate DCGM connections, old handles were not being properly released, causing DCGM to retain stale references that prevented new handle creation. The fix adds explicit Shutdown() calls on handles before deletion and resets all state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) to None/empty after cleanup. This ensures clean recovery from transient failures during both initialization and connectivity checks. The change integrates with the existing _cleanup_dcgm_resources helper method and follows the established error-handling patterns in health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py | 5/5 | Adds DCGM handle cleanup via Shutdown() and state variable resets on initialization and connectivity failures |
Confidence score: 5/5
- This PR is safe to merge with minimal risk and fixes a critical reliability issue in production.
- Score reflects a targeted fix with clear before/after behavior, successful manual validation (100 pod deletion iterations), and no breaking changes—the fix only adds proper cleanup where it was missing.
- No files require special attention; the change is well-scoped to error-handling paths that were previously incomplete.
1 file reviewed, no comments
|
/ok to test 6d40f23 |
1 similar comment
|
/ok to test 6d40f23 |
Merging this branch changes the coverage (1 decrease, 1 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR addresses a resource leak in the GPU health monitor's DCGM integration by ensuring proper handle cleanup on failure paths. Previously, when the monitor encountered connectivity issues or pod restarts, DCGM handles were deleted via Python's garbage collector without explicitly calling Shutdown(), causing DCGM's internal tracking to retain them as active resources. The fix adds explicit Shutdown() calls before handle deletion and resets all monitoring state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) to None/empty after failures. This ensures the monitoring loop can cleanly re-initialize DCGM resources on the next iteration rather than accumulating zombie handles across restarts.
PR Description Notes:
- Minor typo: "dgcm handle" should be "DCGM handle" in the testing description
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py | 4/5 | Added explicit Shutdown() call before handle deletion and state-reset logic after connectivity failures to prevent DCGM handle leaks |
Confidence score: 4/5
- This PR is safe to merge with low risk; it fixes a clear resource leak with a straightforward solution
- Score reflects that the fix follows best practices for resource cleanup and the manual testing demonstrates effectiveness (100 iterations), though the absence of automated tests for this specific failure path and lack of error handling around the
Shutdown()call itself prevent a perfect score - Pay close attention to health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py lines 263-264and 284-299 to verify the cleanup is comprehensive for all failure scenarios
1 file reviewed, no comments
|
/ok to test b5fdae2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical DCGM handle leak in the GPU health monitor by adding explicit Shutdown() calls before deleting DCGM handles and consistently resetting state variables after cleanup. Previously, when the monitor encountered initialization or connectivity failures, it would create new DCGM handles without properly releasing old ones, causing DCGM's internal handle tracking to accumulate orphaned handles. The fix integrates with the existing retry logic (the while dcgm_handle is None loop on line 277) by ensuring all state variables (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) are reset to their initial values, which is critical because the loop condition checks dcgm_handle is None to determine whether re-initialization is needed.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py | 5/5 | Adds explicit Shutdown() calls before deleting DCGM handles and resets all state variables to fix resource leaks in error paths |
Confidence score: 5/5
- This PR is safe to merge with minimal risk
- The fix addresses a well-understood resource leak pattern in the DCGM Python bindings where explicit
Shutdown()is required before handle deletion; the state variable resets ensure the retry loop behaves correctly, and the manual testing with 100 pod deletion iterations provides strong empirical evidence of effectiveness - No files require special attention
1 file reviewed, no comments
|
/ok to test a4fa913 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR addresses a DCGM handle leak in the GPU health monitor by adding proper resource cleanup on shutdown and failure paths. Previously, when pods were restarted or connectivity failed, DCGM handles were abandoned without explicit release, causing DCGM to retain references and eventually exhaust available handles. The fix adds Shutdown() calls before handle deletion and comprehensive state resets, ensuring clean re-initialization on subsequent attempts. This integrates with the existing error handling pattern where dcgm_handle is None signals the need for re-initialization, and aligns with the monitor's retry logic that attempts reconnection every 30 seconds.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py | 5/5 | Added explicit Shutdown() call in cleanup function and state reset after initialization/connectivity failures |
Confidence score: 5/5
- This PR is safe to merge with minimal risk
- Score reflects straightforward resource cleanup logic with clear ownership semantics, validated through 100 pod restart iterations showing consistent handle recreation without exhaustion
- No files require special attention
1 file reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers the entire PR from start to finish.
This PR fixes a critical resource leak in the GPU health monitor's DCGM (NVIDIA Data Center GPU Manager) integration. Previously, when the monitor encountered failures during initialization or runtime, DCGM handles were not properly released, causing handle accumulation that prevented new handle creation. The fix adds three key changes: (1) explicit Shutdown() calls before deleting DCGM handles in _cleanup_dcgm_resources(), (2) state variable resets (dcgm_handle, dcgm_group, gpu_ids, gpu_serials) after cleanup in the initialization error path, and (3) a missing _cleanup_dcgm_resources() call when connectivity failures are detected during health checks. This ensures consistent cleanup behavior across all failure paths—the monitor's main loop relies on dcgm_handle is None checks to trigger re-initialization, so both DCGM-level cleanup and Python variable resets are necessary. The changes integrate cleanly with the existing retry logic in the health monitor's main loop, which already handles re-initialization when handles are null.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| health-monitors/gpu-health-monitor/gpu_health_monitor/dcgm_watcher/dcgm.py | 5/5 | Added explicit Shutdown() call in cleanup method and state resets in two failure paths (initialization error and connectivity loss) to prevent DCGM handle accumulation |
Confidence score: 5/5
- This PR is safe to merge with minimal risk—the changes address a well-defined resource leak with a straightforward fix pattern applied consistently across failure paths.
- Score reflects the focused nature of the fix (three small additions), clear validation (100 pod deletion cycles), and low risk of regression—the changes only affect error paths and add proper cleanup that was missing.
- No files require special attention—the single file changed contains defensive additions that improve resource management without altering the happy path logic.
1 file reviewed, no comments
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
Warning Rate limit exceeded@lalitadithya has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 2 minutes and 8 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test e78bc54 |
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|

Summary
cleanup older dcgm handle which are not getting used in gpu-health-monitor as whenever new handle is created older handle is not getting garbage collected due to which dcgm thinks that those handle is valid.
Type of Change
Component(s) Affected
Testing
Checklist
Fix is working fine as tried 100 iterations of deletion gpu-health-monitor pods and everytime dgcm handle created successfully.