GPUs: remove state associated with deleted ResourceClaims

Originally worded by @klueska:


> There are a few places in the k8s-dra-driver, where cleanup of stale state is needed when claims are force deleted:
>
> 1. Allocated GPUs tracked in the checkpoint file need to be removed and added back to the pool of allocatable GPUs.
> 2. Any MPS daemons associated with the claim need to be stopped and their state cleaned up.
>
> For 2, it may be possible to add the claim itself as an owner reference to the deployment used to launch a given MPS server. That way, the deployment will be automatically deleted as soon as the claim gets deleted. However, the additional state for the pip, log, and shm directories of the MPS server will still need to be cleaned up somehow.
> 
> I imagine we can just start a reconciliation loop that ensures that all claims are still exist if they are referenced in the checkpoint file or the state for the MPS server. And if not -- we trigger the respective cleanup operations.

internal ref: cnt/issues/89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPUs: remove state associated with deleted ResourceClaims #365

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPUs: remove state associated with deleted ResourceClaims #365

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions