-
Notifications
You must be signed in to change notification settings - Fork 98
Description
Originally worded by @klueska:
There are a few places in the k8s-dra-driver, where cleanup of stale state is needed when claims are force deleted:
- Allocated GPUs tracked in the checkpoint file need to be removed and added back to the pool of allocatable GPUs.
- Any MPS daemons associated with the claim need to be stopped and their state cleaned up.
For 2, it may be possible to add the claim itself as an owner reference to the deployment used to launch a given MPS server. That way, the deployment will be automatically deleted as soon as the claim gets deleted. However, the additional state for the pip, log, and shm directories of the MPS server will still need to be cleaned up somehow.
I imagine we can just start a reconciliation loop that ensures that all claims are still exist if they are referenced in the checkpoint file or the state for the MPS server. And if not -- we trigger the respective cleanup operations.
internal ref: cnt/issues/89
Metadata
Metadata
Assignees
Labels
Type
Projects
Status