Skip to content

NVIDIA devices not visible under /dev on AKS GPU nodes with GPU Operator (Pre-compiled driver image) #1741

@schdhry

Description

@schdhry

Environment
• Cluster: AKS 1.31.10
• OS Image: Ubuntu 22.04 (kernel 5.15.0-1092-azure)
• GPU Operator version: v25.3.3 (Helm deployment)
• NVIDIA driver: precompiled driver image (nvidia-driver-570-5.15.0-1092-azure-ubuntu22.04)
• NVIDIA container toolkit: installed by GPU Operator (/usr/local/nvidia/toolkit)

Steps to Reproduce
1. Deploy an AKS cluster with GPU nodepool (Ubuntu 22.04, kernel 5.15.0-1092-azure).
2. Install NVIDIA GPU Operator via Helm (v25.3.3).
3. Confirm driver DaemonSet installs precompiled driver package
4. Deploy a GPU workload (e.g., Triton Inference Server with resources.limits["nvidia.com/gpu"]=1).

Actual Behavior
• Device nodes only appear under:

/run/nvidia/driver/dev/nvidia0
/run/nvidia/driver/dev/nvidiactl
/run/nvidia/driver/dev/nvidia-uvm
/run/nvidia/driver/dev/nvidia-modeset

No /dev/nvidia* entries are created on the host.
Pod creation fails:

failed to generate spec: lstat /dev/nvidiactl: no such file or directory

Expected Behavior
• NVIDIA container runtime should detect them under /dev and allow GPU workloads to run.

Debugging Performed
• ✅ Precompiled driver 570.158.01 installed by DaemonSet without error.
• ✅ lsmod | grep nvidia confirms modules loaded.
• ✅ /run/nvidia/driver/dev/ has device nodes.
• ❌ /dev/nvidia* missing on host.
• ✅ NVIDIA container toolkit installed under /usr/local/nvidia/toolkit.
• BinaryName → /usr/local/toolkit/nvidia-container-runtime.
• ❌ which nvidia-container-runtime → not found.
• ✅ Toolkit DaemonSet logs confirm runtime binaries and config installed.
• ✅ Tried adding NVIDIA_VISIBLE_DEVICES=all and NVIDIA_DRIVER_CAPABILITIES=all.
• ❌ Triton pod still fails (runtime error occurs before container start).

ClusterPolicy Snippet

spec:
  hostPaths:
    driverInstallDir: /run/nvidia/driver
    rootFS: /
  devicePlugin:
    enabled: true
    env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all

Questions
• On AKS, should /dev/nvidia* device nodes be created directly on the host, or is it expected that they remain only under /run/nvidia/driver/dev/?
• Is GPU Operator supposed to configure NVIDIA container runtime differently when using precompiled driver images?

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageissue or PR has not been assigned a priority-px label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions