Skip to content

GPUs: remove state associated with deleted ResourceClaims #365

@jgehrcke

Description

@jgehrcke

Originally worded by @klueska:

There are a few places in the k8s-dra-driver, where cleanup of stale state is needed when claims are force deleted:

  1. Allocated GPUs tracked in the checkpoint file need to be removed and added back to the pool of allocatable GPUs.
  2. Any MPS daemons associated with the claim need to be stopped and their state cleaned up.

For 2, it may be possible to add the claim itself as an owner reference to the deployment used to launch a given MPS server. That way, the deployment will be automatically deleted as soon as the claim gets deleted. However, the additional state for the pip, log, and shm directories of the MPS server will still need to be cleaned up somehow.

I imagine we can just start a reconciliation loop that ensures that all claims are still exist if they are referenced in the checkpoint file or the state for the MPS server. And if not -- we trigger the respective cleanup operations.

internal ref: cnt/issues/89

Metadata

Metadata

Assignees

Labels

robustnessissue/pr: edge cases & fault tolerance

Type

No type

Projects

Status

Backlog

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions