Skip to content

Conversation

@lokielse
Copy link

Summary

This PR improves security by restricting the GPU Operator's ClusterRole permissions to only the specific ClusterRoles and ClusterRoleBindings it manages, following the principle of least privilege.

Problem

Previously, the GPU Operator had unrestricted permissions to create, read, update, and delete any ClusterRole or ClusterRoleBinding in the entire Kubernetes cluster:

- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - clusterroles
  - clusterrolebindings
  verbs:
  - create
  - get
  - list
  - watch
  - update
  - patch
  - delete

This violates the principle of least privilege and poses security risks:

  • The operator could potentially modify critical RBAC resources it doesn't own
  • If compromised, the operator could escalate privileges or tamper with cluster security
  • Unnecessarily broad permissions increase the blast radius of potential security incidents

Solution

The permissions have been split into two RBAC rules:

  1. Rule 1: Allows creating new ClusterRoles/ClusterRoleBindings (without resourceNames restriction, as Kubernetes doesn't allow resourceNames with the create verb)
  2. Rule 2: Restricts get, update, patch, and delete operations to only the 14 specific resources managed by the GPU Operator using the resourceNames field

Resources managed by GPU Operator:

  • nvidia-cc-manager
  • nvidia-device-plugin
  • nvidia-device-plugin-mps-control-daemon
  • nvidia-driver
  • nvidia-gpu-feature-discovery
  • nvidia-kata-manager
  • nvidia-mig-manager
  • nvidia-node-status-exporter
  • nvidia-operator-validator
  • nvidia-sandbox-device-plugin
  • nvidia-sandbox-validator
  • nvidia-vfio-manager
  • nvidia-vgpu-device-manager
  • nvidia-vgpu-manager

Changes

File: deployments/gpu-operator/templates/clusterrole.yaml

  • Split the RBAC rule for ClusterRoles and ClusterRoleBindings into two separate rules
  • Added resourceNames constraint to get, update, patch, and delete verbs
  • Added comments explaining the security improvement and the split-rule pattern

Security Benefits

  1. Prevents privilege escalation: The operator can no longer modify existing ClusterRoles/ClusterRoleBindings it doesn't own
  2. Limits blast radius: Reduces the impact if the operator is compromised
  3. Follows least privilege: Operator only has permissions for resources it actually manages
  4. Maintains functionality: The operator can still perform all necessary operations on its managed resources

Testing

  • YAML syntax validated
  • Verified all managed resource names are included in the resourceNames list
  • Code analysis confirms the operator only manages the listed resources

Implementation Notes

The permission split (create in one rule, modify operations in another with resourceNames) is a standard Kubernetes RBAC pattern because:

  • Kubernetes doesn't allow resourceNames with the create verb (resource names don't exist at creation time)
  • This approach still provides significant security improvement by restricting modification of existing resources

References

Code locations that manage ClusterRoles/ClusterRoleBindings:

  • controllers/resource_manager.go:133-140 - Loads resources from YAML manifests
  • controllers/object_controls.go:421-505 - Creates/updates/deletes the RBAC resources

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lokielse lokielse force-pushed the restrict-clusterrole-clusterrolebinding-permission branch 4 times, most recently from aba8e22 to c691251 Compare November 17, 2025 06:06
@lokielse lokielse force-pushed the restrict-clusterrole-clusterrolebinding-permission branch from c691251 to 7d21ad2 Compare November 17, 2025 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant