-
Notifications
You must be signed in to change notification settings - Fork 435
Description
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed
Machine Specs & Environment Setup
OS : Ubuntu 24.04 LTS
NVIDIA Driver : 550.90.07
CUDA Version : 12.4
GPU Model : A6000
Docker Version : 24.0.7, build afdd53b
Steps of Setting up the Nvidia Container Toolkit
I followed the steps mentioned in this documentation: Installing the Toolkit
Installing with Apt
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
Configuring Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Issue 2:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown.
Caused by: sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Notes:
Tried to run the above command after starting the nvidia persistenced mode by sudo nvidia-smi -pm ENABLED command, and that didn't resolved it, also tried sudo nvidia-persistenced to start it as a daemon srevice
Then finally tried unload and reload of the NVIDIA drivers:
sudo systemctl stop gdm
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia
sudo modprobe nvidia
sudo modprobe nvidia_modeset
sudo modprobe nvidia_drm
sudo modprobe nvidia_uvm
sudo systemctl start gdm
sudo systemctl restart docker
After reloading tried executing the same docker command and now i'm getting a different error mentioned below.
Issue 2:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: failed to inject devices: failed to stat CDI host device "/dev/dri/card1": no such file or directory: unknown.
Caused by:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Notes:
This issue originated after unloading and reloading of the NVIDIA drivers in the attempt to fix the Issue 1 (which is still not resolved).
Special Notes
nvidia-persistenced not starting when attempted to start as in the following command sudo nvidia-persistenced throwing an error message:
nvidia-persistenced failed to initialize. Check syslog for more details.
After investigating the syslogs found these logs:
sudo cat /var/log/syslog | grep nvidia-persistenced
2024-09-03T20:57:06.322985+00:00 toro docker.nvidia-container-toolkit[1787337]: time="2024-09-03T20:57:06Z" level=info msg="Selecting /run/nvidia-persistenced/socket as /run/nvidia-persistenced/socket"
2024-09-03T20:57:06.324024+00:00 toro docker.nvidia-container-toolkit[1787337]: time="2024-09-03T20:57:06Z" level=info msg="Selecting /var/lib/snapd/hostfs/usr/bin/nvidia-persistenced as /var/lib/snapd/hostfs/usr/bin/nvidia-persistenced"
2024-09-03T20:57:06.657932+00:00 toro docker.nvidia-container-toolkit[1787496]: time="2024-09-03T20:57:06Z" level=info msg="Selecting /run/nvidia-persistenced/socket as /run/nvidia-persistenced/socket"
2024-09-03T20:57:06.658053+00:00 toro docker.nvidia-container-toolkit[1787496]: time="2024-09-03T20:57:06Z" level=info msg="Selecting /var/lib/snapd/hostfs/usr/bin/nvidia-persistenced as /var/lib/snapd/hostfs/usr/bin/nvidia-persistenced"
2024-09-06T11:39:16.670934+00:00 toro nvidia-persistenced: device 0000:01:00.0 - persistence mode disabled.
2024-09-06T11:39:16.671113+00:00 toro nvidia-persistenced: device 0000:01:00.0 - NUMA memory offlined.
2024-09-06T11:39:24.050313+00:00 toro nvidia-persistenced: device 0000:01:00.0 - persistence mode enabled.
2024-09-06T11:39:24.050493+00:00 toro nvidia-persistenced: device 0000:01:00.0 - NUMA memory onlined.
2024-09-06T11:59:14.012734+00:00 toro nvidia-persistenced: Failed to lock PID file: Resource temporarily unavailable
2024-09-06T11:59:14.012804+00:00 toro nvidia-persistenced: Shutdown (1902356)
2024-09-06T11:59:19.584467+00:00 toro nvidia-persistenced: Failed to lock PID file: Resource temporarily unavailable
2024-09-06T11:59:19.584576+00:00 toro nvidia-persistenced: Shutdown (1902361)
2024-09-06T11:59:59.116947+00:00 toro nvidia-persistenced: Failed to lock PID file: Resource temporarily unavailable
2024-09-06T11:59:59.117104+00:00 toro nvidia-persistenced: Shutdown (1902386)
2024-09-06T12:00:43.320943+00:00 toro nvidia-persistenced: device 0000:01:00.0 - persistence mode disabled.
2024-09-06T12:00:43.321232+00:00 toro nvidia-persistenced: device 0000:01:00.0 - NUMA memory offlined.
2024-09-06T12:00:46.881181+00:00 toro nvidia-persistenced: Failed to lock PID file: Resource temporarily unavailable
2024-09-06T12:00:46.881291+00:00 toro nvidia-persistenced: Shutdown (1902413)
2024-09-06T12:01:13.921359+00:00 toro nvidia-persistenced: Failed to lock PID file: Resource temporarily unavailable
2024-09-06T12:01:13.921592+00:00 toro nvidia-persistenced: Shutdown (1902426)
I'm not sure if this is the root cause for the issues I'm experiencing, hope these information will helps to troubleshoot and provide a solution.