Skip to content

Rootless Docker CDI Injection: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=all: unknown. #434

@LukasIAO

Description

@LukasIAO

Hello everyone,

we have recently set up a rootless docker instance alongside our existing docker on one of our servers, but ran into issues mounting host GPUs into the rootless containers. A workaround was presented in issue #85 (toggling no-cgroups to switch between rootful and rootless) with a mention of a better solution in the form of Nvidia CDI coming as an experimental feature in Docker 25.

After updating to the newest Docker releases and setting up CDI, our regular Docker instance behaved as we expected based on the documentation, but the rootless instance still runs into issues.

Setup to reproduce:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

NVIDIA Container Toolkit CLI version 1.14.6
commit: 5605d191332dcfeea802c4497360d60a65c7887e

rootless: containerd github.com/containerd/containerd v1.7.13 7c3aca7a610df76212171d200ca3811ff6096eb8
rootful: containerd containerd.io 1.6.28 ae07eda36dd25f8a1b98dfbf587313b99c0190bb
+---------------------------------------------------------------------------------------+`
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:01:00.0 Off |                    0 |
| N/A   40C    P0              61W / 275W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:47:00.0 Off |                    0 |
| N/A   39C    P0              55W / 275W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:81:00.0 Off |                    0 |
| N/A   39C    P0              57W / 275W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA DGX Display             On  | 00000000:C1:00.0 Off |                  N/A |
| 34%   41C    P8              N/A /  50W |      1MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          On  | 00000000:C2:00.0 Off |                    0 |
| N/A   39C    P0              58W / 275W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
config.toml (click to expand)
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
  • sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
nvidia.yaml (click to expand)
cdiVersion: 0.5.0
containerEdits:
  deviceNodes:
  - path: /dev/nvidia-modeset
  - path: /dev/nvidia-uvm
  - path: /dev/nvidia-uvm-tools
  - path: /dev/nvidiactl
  hooks:
  - args:
    - nvidia-ctk
    - hook
    - create-symlinks
    - --link
    - libglxserver_nvidia.so.535.161.07::/lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
  - args:
    - nvidia-ctk
    - hook
    - update-ldcache
    - --folder
    - /lib/x86_64-linux-gnu
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
  mounts:
  - containerPath: /lib/x86_64-linux-gnu/libEGL_nvidia.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libEGL_nvidia.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libGLESv2_nvidia.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libGLESv2_nvidia.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libGLX_nvidia.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libGLX_nvidia.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libcuda.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libcuda.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libcudadebugger.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libcudadebugger.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvcuvid.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvcuvid.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-allocator.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-allocator.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-cfg.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-cfg.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.0
    hostPath: /lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.0
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-encode.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-encode.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-fbc.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-fbc.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-glcore.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-glcore.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-glsi.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-glsi.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-ml.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-ml.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-ngx.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-ngx.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-nscq.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-nscq.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-nvvm.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-nvvm.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-opencl.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-opencl.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-opticalflow.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-opticalflow.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-pkcs11.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-pkcs11.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-rtcore.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-rtcore.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-tls.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-tls.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvidia-vulkan-producer.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvidia-vulkan-producer.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/libnvoptix.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/libnvoptix.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /run/nvidia-persistenced/socket
    hostPath: /run/nvidia-persistenced/socket
    options:
    - ro
    - nosuid
    - nodev
    - bind
    - noexec
  - containerPath: /usr/bin/nvidia-cuda-mps-control
    hostPath: /usr/bin/nvidia-cuda-mps-control
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-cuda-mps-server
    hostPath: /usr/bin/nvidia-cuda-mps-server
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-debugdump
    hostPath: /usr/bin/nvidia-debugdump
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-persistenced
    hostPath: /usr/bin/nvidia-persistenced
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-smi
    hostPath: /usr/bin/nvidia-smi
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/nvidia/nvoptix.bin
    hostPath: /usr/share/nvidia/nvoptix.bin
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/firmware/nvidia/535.161.07/gsp_ga10x.bin
    hostPath: /lib/firmware/nvidia/535.161.07/gsp_ga10x.bin
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/firmware/nvidia/535.161.07/gsp_tu10x.bin
    hostPath: /lib/firmware/nvidia/535.161.07/gsp_tu10x.bin
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.535.161.07
    hostPath: /lib/x86_64-linux-gnu/nvidia/xorg/libglxserver_nvidia.so.535.161.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
    hostPath: /lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/X11/xorg.conf.d/10-nvidia.conf
    hostPath: /usr/share/X11/xorg.conf.d/10-nvidia.conf
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json
    hostPath: /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
    hostPath: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/vulkan/icd.d/nvidia_icd.json
    hostPath: /usr/share/vulkan/icd.d/nvidia_icd.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/vulkan/implicit_layer.d/nvidia_layers.json
    hostPath: /usr/share/vulkan/implicit_layer.d/nvidia_layers.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia4
    - path: /dev/dri/card5
    - path: /dev/dri/renderD132
    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card5::/dev/dri/by-path/pci-0000:01:00.0-card
      - --link
      - ../renderD132::/dev/dri/by-path/pci-0000:01:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
    - args:
      - nvidia-ctk
      - hook
      - chmod
      - --mode
      - "755"
      - --path
      - /dev/dri
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
  name: "0"
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia3
    - path: /dev/dri/card4
    - path: /dev/dri/renderD131
    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card4::/dev/dri/by-path/pci-0000:47:00.0-card
      - --link
      - ../renderD131::/dev/dri/by-path/pci-0000:47:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
    - args:
      - nvidia-ctk
      - hook
      - chmod
      - --mode
      - "755"
      - --path
      - /dev/dri
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
  name: "1"
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia2
    - path: /dev/dri/card3
    - path: /dev/dri/renderD130
    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card3::/dev/dri/by-path/pci-0000:81:00.0-card
      - --link
      - ../renderD130::/dev/dri/by-path/pci-0000:81:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
    - args:
      - nvidia-ctk
      - hook
      - chmod
      - --mode
      - "755"
      - --path
      - /dev/dri
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
  name: "2"
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia1
    - path: /dev/dri/card2
    - path: /dev/dri/renderD129
    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card2::/dev/dri/by-path/pci-0000:c2:00.0-card
      - --link
      - ../renderD129::/dev/dri/by-path/pci-0000:c2:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
    - args:
      - nvidia-ctk
      - hook
      - chmod
      - --mode
      - "755"
      - --path
      - /dev/dri
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
  name: "4"
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia1
    - path: /dev/nvidia2
    - path: /dev/nvidia3
    - path: /dev/nvidia4
    - path: /dev/dri/card2
    - path: /dev/dri/card3
    - path: /dev/dri/card4
    - path: /dev/dri/card5
    - path: /dev/dri/renderD129
    - path: /dev/dri/renderD130
    - path: /dev/dri/renderD131
    - path: /dev/dri/renderD132
    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card5::/dev/dri/by-path/pci-0000:01:00.0-card
      - --link
      - ../renderD132::/dev/dri/by-path/pci-0000:01:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
    - args:
      - nvidia-ctk
      - hook
      - chmod
      - --mode
      - "755"
      - --path
      - /dev/dri
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card4::/dev/dri/by-path/pci-0000:47:00.0-card
      - --link
      - ../renderD131::/dev/dri/by-path/pci-0000:47:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card3::/dev/dri/by-path/pci-0000:81:00.0-card
      - --link
      - ../renderD130::/dev/dri/by-path/pci-0000:81:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card2::/dev/dri/by-path/pci-0000:c2:00.0-card
      - --link
      - ../renderD129::/dev/dri/by-path/pci-0000:c2:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
  name: all
kind: nvidia.com/gpu
INFO[0000] Found 5 CDI devices
nvidia.com/gpu=0
nvidia.com/gpu=1
nvidia.com/gpu=2
nvidia.com/gpu=4
nvidia.com/gpu=all
  • Rootfull Docker version 26.0.0, build 2ae903e
  • Rootless Docker version 26.0.0, build 2ae903e (install script)

The issue:
When no-cgroups = false CDI injection works fine for the regular Docker instance:

$ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-b6022b4d-71db-8f15-15de-26a719f6b3e1)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-22420f7d-6edb-e44a-c322-4ce539cade19)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-5e3444e2-8577-0e99-c6ee-72f6eb2bd28c)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-dd1f811d-a280-7e2e-bf7e-b84f7a977cc1)

but produces the following errors for the rootless version:

$ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all ubuntu nvidia-smi -L
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=all: unknown.

Running docker run --rm --gpus all ubuntu nvidia-smi results in the same error as without OCI. This seems to be consistent across all variations listed on the Specialized Configurations for Docker page:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.

Interestingly, setting no-cgroups = true disables the regular use of GPUs with rootful Docker:

$ docker run --rm --gpus all ubuntu nvidia-smi
Failed to initialize NVML: Unknown Error

but still allows for CDI injections:

$ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-b6022b4d-71db-8f15-15de-26a719f6b3e1)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-22420f7d-6edb-e44a-c322-4ce539cade19)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-5e3444e2-8577-0e99-c6ee-72f6eb2bd28c)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-dd1f811d-a280-7e2e-bf7e-b84f7a977cc1)

With control groups disabled, the rootless daemon is able to use exposed GPUs as outlined in the Docker docs:

$ docker run -it --rm --gpus '"device=0,2"' ubuntu nvidia-smi
Mon Apr  1 16:33:52 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:01:00.0 Off |                    0 |
| N/A   37C    P0              60W / 275W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off | 00000000:81:00.0 Off |                    0 |
| N/A   36C    P0              56W / 275W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

TLDR
Disabling c-groups allows the rootless containers to use exposed GPUs using the regular docker run --gpus flag. This in turn disables the rootful container's GPU access. Leaving control groups enabled reverses the effect, as outlined in #85 .

Disabling c-groups and using Nvidia CDI, the rootful Docker can still use GPU injection, even though regular GPU access is barred, while the rootless container uses the exposed GPUs. CDI injection for rootless fails in both cases, however.

This seems like a definite improvement, but I'm not sure it's intended behavior. The CDI injection failing with rootless regardless of control group setting leads me to believe this is unintended, unless rootless is not yet supported by Nvidia CDI.

Any insights would be greatly appreciated!

Metadata

Metadata

Assignees

Labels

lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions