Seems like the operator stores state, either in-memory or in the ClusterPolicy CRD in such a way that deleting a Node and re-joining it won't work. One has to manually label the Node with nvidia.com/gpu.dep-loy-mig-manager=true, otherwise it seems the operator skips over the MIG configuration completely and things fail.
The main symptom is that the validation pod fails with Failed to allocate device vector A (error code initialization error) -- which is a red herring.
What am I missing? Or is this just an unsupported scenario?
Thanks