Skip to content

Regression on MPS setups following changes in clusterrole #229

@dekonnection

Description

@dekonnection

Hello,

After deploying an up-to-date version of the driver through Helm, I encountered this failure when trying to start a pod that uses an MPS claim:

  Warning  FailedPrepareDynamicResources  4s    kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim lab/mps-gpu-7bbd549f7b-z2vgr-mps-gpus-k6kf5: error preparing devices for claim 74689d61-7a6d-43d4-aa60-29c93c7ab7ea: prepare devices failed: error applying GPU config: error starting MPS control daemon: error checking if control daemon already started: failed to get deployment: deployments.apps "mps-control-daemon-74689d61-7a6d-43d4-aa60-29c93c7ab7ea-44f48" is forbidden: User "system:serviceaccount:nvidia-dra:nvidia-dra-k8s-dra-driver-service-account" cannot get resource "deployments" in API group "apps" in the namespace "nvidia-dra"

I did not experience this a few weeks ago with an identical setup, so after checking the most recent changes in the Helm template, I found that the ClusterRole has been modified by 4253b44 (part of #219 ), in a way that prevents the ServiceAccount to manage Deployments.

If I revert this change and update the ClusterRole, everything works as it did before.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Closed

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions