how to get Vulkan working on Docker swarm

I am trying to deploy docker swarm stack with an app that needs GPU Vulkan acceleration. It works in standalone docker when I run it like this:
```
$ docker run --rm \
    --gpus all \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e VK_ICD_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json \
    papajnawrotkach/kurp:latest
```

My swarm has two manager nodes where the first one is my PC with GPU and the other one is a VPS. Only on my PC which will be running the container I installed NVIDIA Container Toolkit and tweaked the configuration based on the information found in the Internet:

```jsonc
// /etc/docker/daemon.json

{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
// added the following lines
  "default-runtime": "nvidia",
  "node-generic-resources": [
    "NVIDIA-GPU=GPU-5f2feb73"
  ]
}
```

```toml
# /etc/nvidia-container-runtime/config.toml

#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
swarm-resource = "DOCKER_RESOURCE_GPU" # uncommented this line

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"
```

My `docker-compose.yml` for the stack looks like this:
```yaml
services:
  kurp:
    image: docker.io/papajnawrotkach/kurp:latest
    hostname: kurp
    user: ${PUID}:${PGID}
    deploy:
      mode: global
      placement:
        constraints:
          - node.hostname == mypc
      restart_policy:
        condition: any
      labels:
        traefik.enable: "true"

        traefik.http.routers.kurp.entrypoints: https
        traefik.http.routers.kurp.rule: "Host(`kurp.${DOMAIN}`)"

        traefik.http.services.kurp.loadbalancer.server.port: $KURP_PORT
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec: { kind: "NVIDIA-GPU", value: 0 }
    configs:
      - source: kurp_config
        target: /config/config.yml
    environment:
      NVIDIA_DRIVER_CAPABILITIES: all
      VK_ICD_FILENAMES: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
      KURP_PORT:
      KURP_UPSTREAM_URL: http://komga:${KOMGA_PORT}
      RUST_LOG: info
      RUST_BACKTRACE: 1

configs:
  kurp_config:
    file: ./kurp.yml

networks:
  default:
    name: public
    external: true
```

The environment variables:
```dotenv
PUID=1000
PGID=1000
DOMAIN=mydomain.net
KURP_PORT=25632
```

When I deploy the stack:
```
vkCreateInstance failed -9

thread 'tokio-runtime-worker' panicked at realcugan-ncnn-vulkan-rs/src/realcugan.rs:114:17:
invalid gpu device
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: realcugan_ncnn_vulkan_rs::realcugan::RealCugan::new
   3: kurp::upscaler::upscaler::RealCuganUpscaler::new
   4: <kurp::upscaler::upscale_actor::UpscaleActor as ractor::actor::Actor>::pre_start::{{closure}}
   5: ractor::actor::Actor::spawn_linked::{{closure}}
   6: <kurp::upscaler::upscale_actor::UpscaleSupervisorActor as ractor::actor::Actor>::handle::{{closure}}
   7: <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
   8: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
   9: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
  10: ractor::actor::ActorRuntime<TActor>::start::{{closure}}::{{closure}}
  11: tokio::runtime::task::core::Core<T,S>::poll
  12: tokio::runtime::task::harness::Harness<T,S>::poll
  13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  14: tokio::runtime::context::scoped::Scoped<T>::set
  15: tokio::runtime::context::runtime::enter_runtime
  16: tokio::runtime::scheduler::multi_thread::worker::run
  17: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  18: tokio::runtime::task::core::Core<T,S>::poll
  19: tokio::runtime::task::harness::Harness<T,S>::poll
  20: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```

I believe it is caused by the lack of `/usr/share/glvnd/egl_vendor.d/10_nvidia.json` file. The file is present when run with `docker run ...` but not with `docker stack deploy ...`. What is the correct way to set up NVIDIA Container Toolkit for Docker swarm?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to get Vulkan working on Docker swarm #1047

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

how to get Vulkan working on Docker swarm #1047

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions