Skip to content

Unexpected GPU Allocation with NVIDIA_VISIBLE_DEVICES in Kubernetes #951

@qiangyupei

Description

@qiangyupei

1. Quick Debug Information

  • Kubernetes Version: v1.28
  • GPU Operator Version: v24.6.1

2. Issue description

The Kubernetes cluster has two worker nodes and each contains four A100 GPUs. During pod deployment, I use the NVIDIA_VISIBLE_DEVICES environment to specify which GPU to use (e.g., "3") (following the instructions in the link). However, when I run the kubectl exec -it [pod_name] -- nvidia-smi command, it sometimes shows only the specified GPU, but at other times, it displays an additional GPU alongside the specified one. The following picture illustrates the result. This causes some trouble for me. I'm wondering if there might be an issue.

image

I deploy GPU Operator with the following command:

helm install gpu-operator \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set driver.enabled=false \
    --set mig.strategy=mixed \
    -f gpu-operator-values.yaml \
    --set dcgmExporter.config.name=custom-dcgm-metrics

All the GPU-operator pods are running well:

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions