-
Notifications
You must be signed in to change notification settings - Fork 435
Description
I am trying to deploy docker swarm stack with an app that needs GPU Vulkan acceleration. It works in standalone docker when I run it like this:
$ docker run --rm \
--gpus all \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e VK_ICD_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json \
papajnawrotkach/kurp:latest
My swarm has two manager nodes where the first one is my PC with GPU and the other one is a VPS. Only on my PC which will be running the container I installed NVIDIA Container Toolkit and tweaked the configuration based on the information found in the Internet:
# /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
swarm-resource = "DOCKER_RESOURCE_GPU" # uncommented this line
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false
[nvidia-ctk]
path = "nvidia-ctk"My docker-compose.yml for the stack looks like this:
services:
kurp:
image: docker.io/papajnawrotkach/kurp:latest
hostname: kurp
user: ${PUID}:${PGID}
deploy:
mode: global
placement:
constraints:
- node.hostname == mypc
restart_policy:
condition: any
labels:
traefik.enable: "true"
traefik.http.routers.kurp.entrypoints: https
traefik.http.routers.kurp.rule: "Host(`kurp.${DOMAIN}`)"
traefik.http.services.kurp.loadbalancer.server.port: $KURP_PORT
resources:
reservations:
generic_resources:
- discrete_resource_spec: { kind: "NVIDIA-GPU", value: 0 }
configs:
- source: kurp_config
target: /config/config.yml
environment:
NVIDIA_DRIVER_CAPABILITIES: all
VK_ICD_FILENAMES: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
KURP_PORT:
KURP_UPSTREAM_URL: http://komga:${KOMGA_PORT}
RUST_LOG: info
RUST_BACKTRACE: 1
configs:
kurp_config:
file: ./kurp.yml
networks:
default:
name: public
external: trueThe environment variables:
PUID=1000
PGID=1000
DOMAIN=mydomain.net
KURP_PORT=25632When I deploy the stack:
vkCreateInstance failed -9
thread 'tokio-runtime-worker' panicked at realcugan-ncnn-vulkan-rs/src/realcugan.rs:114:17:
invalid gpu device
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: realcugan_ncnn_vulkan_rs::realcugan::RealCugan::new
3: kurp::upscaler::upscaler::RealCuganUpscaler::new
4: <kurp::upscaler::upscale_actor::UpscaleActor as ractor::actor::Actor>::pre_start::{{closure}}
5: ractor::actor::Actor::spawn_linked::{{closure}}
6: <kurp::upscaler::upscale_actor::UpscaleSupervisorActor as ractor::actor::Actor>::handle::{{closure}}
7: <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
8: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
9: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
10: ractor::actor::ActorRuntime<TActor>::start::{{closure}}::{{closure}}
11: tokio::runtime::task::core::Core<T,S>::poll
12: tokio::runtime::task::harness::Harness<T,S>::poll
13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
14: tokio::runtime::context::scoped::Scoped<T>::set
15: tokio::runtime::context::runtime::enter_runtime
16: tokio::runtime::scheduler::multi_thread::worker::run
17: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
18: tokio::runtime::task::core::Core<T,S>::poll
19: tokio::runtime::task::harness::Harness<T,S>::poll
20: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
I believe it is caused by the lack of /usr/share/glvnd/egl_vendor.d/10_nvidia.json file. The file is present when run with docker run ... but not with docker stack deploy .... What is the correct way to set up NVIDIA Container Toolkit for Docker swarm?