gpu-operator with MIG won't work if GPU Node is deleted from cluster, reprovisioned, and then re-joined with the same name

Seems like the operator stores state, either in-memory or in the ClusterPolicy CRD in such a way that deleting a Node and re-joining it won't work. One has to manually label the Node with `nvidia.com/gpu.dep-loy-mig-manager=true`, otherwise it seems the operator skips over the MIG configuration completely and things fail.

The main symptom is that the validation pod fails with `Failed to allocate device vector A (error code initialization error)` -- which is a red herring.

What am I missing? Or is this just an unsupported scenario?

Thanks


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpu-operator with MIG won't work if GPU Node is deleted from cluster, reprovisioned, and then re-joined with the same name #873

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gpu-operator with MIG won't work if GPU Node is deleted from cluster, reprovisioned, and then re-joined with the same name #873

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions