Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Nov 18, 2025

This patch refactors the device health check system by extracting the logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context.
No behavior changes - this is a pure refactoring to improve code modularity and testability.

@ArangoGutierrez ArangoGutierrez self-assigned this Nov 18, 2025
@ArangoGutierrez ArangoGutierrez force-pushed the gtg branch 2 times, most recently from 541c6cb to 44d450f Compare November 18, 2025 18:13
@ArangoGutierrez ArangoGutierrez added the feature issue/PR that proposes a new feature or functionality label Nov 18, 2025
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review November 18, 2025 18:45
}()

// Start recovery worker to detect when unhealthy devices become healthy
go plugin.runRecoveryWorker()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split the refactoring (that doesn't add any new behaviour) into a different PR from the one that adds devices becoming healthy again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a good idea, and even more based on your other comment #1508 (review)
I wanted a re-factor, but that interface is a diff conversation. Going to work on splitting this PR

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of the k8s-dra-driver-gpu we discused the Interface that we would expect a DeviceHealthCheckProvider to have. Where is that considered here? From the perspective of the device plugin (or its associated ResourceManager), I would expect a DevideHealthCheckProvider to be instantiated and we would develop against this intervace.

As I discussed in NVIDIA/k8s-dra-driver-gpu#689 I would expect this interface to look something like:

type DeviceHealthCheckProvider interface {
   Start(context.Context) error
   Stop()
   Health() <-channel Device

(alternatively one could split the Health channel into Healthy() and Unhealthy()).

Extract device health checking logic into a dedicated HealthProvider
interface with proper lifecycle management using WaitGroups and context.

- Add HealthProvider interface (Start/Stop/Health methods)
- Implement nvmlHealthProvider with WaitGroup coordination
- Update ResourceManager to return HealthProvider instead of CheckHealth
- Update device plugin to use HealthProvider
- Add no-op implementation for Tegra devices

This refactoring improves code modularity and testability without
changing existing behavior. Prepares foundation for future device
recovery features.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants