Skip to content

how to get Vulkan working on Docker swarm #1047

@papaj-na-wrotkach

Description

@papaj-na-wrotkach

I am trying to deploy docker swarm stack with an app that needs GPU Vulkan acceleration. It works in standalone docker when I run it like this:

$ docker run --rm \
    --gpus all \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e VK_ICD_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json \
    papajnawrotkach/kurp:latest

My swarm has two manager nodes where the first one is my PC with GPU and the other one is a VPS. Only on my PC which will be running the container I installed NVIDIA Container Toolkit and tweaked the configuration based on the information found in the Internet:

// /etc/docker/daemon.json

{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
// added the following lines
  "default-runtime": "nvidia",
  "node-generic-resources": [
    "NVIDIA-GPU=GPU-5f2feb73"
  ]
}
# /etc/nvidia-container-runtime/config.toml

#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
swarm-resource = "DOCKER_RESOURCE_GPU" # uncommented this line

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

My docker-compose.yml for the stack looks like this:

services:
  kurp:
    image: docker.io/papajnawrotkach/kurp:latest
    hostname: kurp
    user: ${PUID}:${PGID}
    deploy:
      mode: global
      placement:
        constraints:
          - node.hostname == mypc
      restart_policy:
        condition: any
      labels:
        traefik.enable: "true"

        traefik.http.routers.kurp.entrypoints: https
        traefik.http.routers.kurp.rule: "Host(`kurp.${DOMAIN}`)"

        traefik.http.services.kurp.loadbalancer.server.port: $KURP_PORT
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec: { kind: "NVIDIA-GPU", value: 0 }
    configs:
      - source: kurp_config
        target: /config/config.yml
    environment:
      NVIDIA_DRIVER_CAPABILITIES: all
      VK_ICD_FILENAMES: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
      KURP_PORT:
      KURP_UPSTREAM_URL: http://komga:${KOMGA_PORT}
      RUST_LOG: info
      RUST_BACKTRACE: 1

configs:
  kurp_config:
    file: ./kurp.yml

networks:
  default:
    name: public
    external: true

The environment variables:

PUID=1000
PGID=1000
DOMAIN=mydomain.net
KURP_PORT=25632

When I deploy the stack:

vkCreateInstance failed -9

thread 'tokio-runtime-worker' panicked at realcugan-ncnn-vulkan-rs/src/realcugan.rs:114:17:
invalid gpu device
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: realcugan_ncnn_vulkan_rs::realcugan::RealCugan::new
   3: kurp::upscaler::upscaler::RealCuganUpscaler::new
   4: <kurp::upscaler::upscale_actor::UpscaleActor as ractor::actor::Actor>::pre_start::{{closure}}
   5: ractor::actor::Actor::spawn_linked::{{closure}}
   6: <kurp::upscaler::upscale_actor::UpscaleSupervisorActor as ractor::actor::Actor>::handle::{{closure}}
   7: <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
   8: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
   9: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
  10: ractor::actor::ActorRuntime<TActor>::start::{{closure}}::{{closure}}
  11: tokio::runtime::task::core::Core<T,S>::poll
  12: tokio::runtime::task::harness::Harness<T,S>::poll
  13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  14: tokio::runtime::context::scoped::Scoped<T>::set
  15: tokio::runtime::context::runtime::enter_runtime
  16: tokio::runtime::scheduler::multi_thread::worker::run
  17: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  18: tokio::runtime::task::core::Core<T,S>::poll
  19: tokio::runtime::task::harness::Harness<T,S>::poll
  20: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I believe it is caused by the lack of /usr/share/glvnd/egl_vendor.d/10_nvidia.json file. The file is present when run with docker run ... but not with docker stack deploy .... What is the correct way to set up NVIDIA Container Toolkit for Docker swarm?

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions