-
Notifications
You must be signed in to change notification settings - Fork 755
Refactor checkHealth function #1508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
541c6cb to
44d450f
Compare
internal/plugin/server.go
Outdated
| }() | ||
|
|
||
| // Start recovery worker to detect when unhealthy devices become healthy | ||
| go plugin.runRecoveryWorker() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split the refactoring (that doesn't add any new behaviour) into a different PR from the one that adds devices becoming healthy again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like a good idea, and even more based on your other comment #1508 (review)
I wanted a re-factor, but that interface is a diff conversation. Going to work on splitting this PR
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of the k8s-dra-driver-gpu we discused the Interface that we would expect a DeviceHealthCheckProvider to have. Where is that considered here? From the perspective of the device plugin (or its associated ResourceManager), I would expect a DevideHealthCheckProvider to be instantiated and we would develop against this intervace.
As I discussed in NVIDIA/k8s-dra-driver-gpu#689 I would expect this interface to look something like:
type DeviceHealthCheckProvider interface {
Start(context.Context) error
Stop()
Health() <-channel Device
(alternatively one could split the Health channel into Healthy() and Unhealthy()).
Extract device health checking logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context. - Add HealthProvider interface (Start/Stop/Health methods) - Implement nvmlHealthProvider with WaitGroup coordination - Update ResourceManager to return HealthProvider instead of CheckHealth - Update device plugin to use HealthProvider - Add no-op implementation for Tegra devices This refactoring improves code modularity and testability without changing existing behavior. Prepares foundation for future device recovery features. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
This patch refactors the device health check system by extracting the logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context.
No behavior changes - this is a pure refactoring to improve code modularity and testability.