Skip to content

gpu-operator with MIG won't work if GPU Node is deleted from cluster, reprovisioned, and then re-joined with the same name #873

@rpardini

Description

@rpardini

Seems like the operator stores state, either in-memory or in the ClusterPolicy CRD in such a way that deleting a Node and re-joining it won't work. One has to manually label the Node with nvidia.com/gpu.dep-loy-mig-manager=true, otherwise it seems the operator skips over the MIG configuration completely and things fail.

The main symptom is that the validation pod fails with Failed to allocate device vector A (error code initialization error) -- which is a red herring.

What am I missing? Or is this just an unsupported scenario?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions