Skip to content

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed. #679

@praxi-roshan

Description

@praxi-roshan

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed

Machine Specs & Environment Setup
OS : Ubuntu 24.04 LTS
NVIDIA Driver : 550.90.07
CUDA Version : 12.4
GPU Model : A6000
Docker Version : 24.0.7, build afdd53b

Steps of Setting up the Nvidia Container Toolkit
I followed the steps mentioned in this documentation: Installing the Toolkit

Installing with Apt

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

Configuring Docker

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

Issue 2:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown.

Caused by: sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Notes:
Tried to run the above command after starting the nvidia persistenced mode by sudo nvidia-smi -pm ENABLED command, and that didn't resolved it, also tried sudo nvidia-persistenced to start it as a daemon srevice

Then finally tried unload and reload of the NVIDIA drivers:

sudo systemctl stop gdm

sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia

sudo modprobe nvidia
sudo modprobe nvidia_modeset
sudo modprobe nvidia_drm
sudo modprobe nvidia_uvm

sudo systemctl start gdm

sudo systemctl restart docker

After reloading tried executing the same docker command and now i'm getting a different error mentioned below.

Issue 2:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: failed to inject devices: failed to stat CDI host device "/dev/dri/card1": no such file or directory: unknown.

Caused by:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Notes:
This issue originated after unloading and reloading of the NVIDIA drivers in the attempt to fix the Issue 1 (which is still not resolved).

Special Notes
nvidia-persistenced not starting when attempted to start as in the following command sudo nvidia-persistenced throwing an error message:

nvidia-persistenced failed to initialize. Check syslog for more details.

After investigating the syslogs found these logs:

sudo cat /var/log/syslog | grep nvidia-persistenced
2024-09-03T20:57:06.322985+00:00 toro docker.nvidia-container-toolkit[1787337]: time="2024-09-03T20:57:06Z" level=info msg="Selecting /run/nvidia-persistenced/socket as /run/nvidia-persistenced/socket"
2024-09-03T20:57:06.324024+00:00 toro docker.nvidia-container-toolkit[1787337]: time="2024-09-03T20:57:06Z" level=info msg="Selecting /var/lib/snapd/hostfs/usr/bin/nvidia-persistenced as /var/lib/snapd/hostfs/usr/bin/nvidia-persistenced"
2024-09-03T20:57:06.657932+00:00 toro docker.nvidia-container-toolkit[1787496]: time="2024-09-03T20:57:06Z" level=info msg="Selecting /run/nvidia-persistenced/socket as /run/nvidia-persistenced/socket"
2024-09-03T20:57:06.658053+00:00 toro docker.nvidia-container-toolkit[1787496]: time="2024-09-03T20:57:06Z" level=info msg="Selecting /var/lib/snapd/hostfs/usr/bin/nvidia-persistenced as /var/lib/snapd/hostfs/usr/bin/nvidia-persistenced"
2024-09-06T11:39:16.670934+00:00 toro nvidia-persistenced: device 0000:01:00.0 - persistence mode disabled.
2024-09-06T11:39:16.671113+00:00 toro nvidia-persistenced: device 0000:01:00.0 - NUMA memory offlined.
2024-09-06T11:39:24.050313+00:00 toro nvidia-persistenced: device 0000:01:00.0 - persistence mode enabled.
2024-09-06T11:39:24.050493+00:00 toro nvidia-persistenced: device 0000:01:00.0 - NUMA memory onlined.
2024-09-06T11:59:14.012734+00:00 toro nvidia-persistenced: Failed to lock PID file: Resource temporarily unavailable
2024-09-06T11:59:14.012804+00:00 toro nvidia-persistenced: Shutdown (1902356)
2024-09-06T11:59:19.584467+00:00 toro nvidia-persistenced: Failed to lock PID file: Resource temporarily unavailable
2024-09-06T11:59:19.584576+00:00 toro nvidia-persistenced: Shutdown (1902361)
2024-09-06T11:59:59.116947+00:00 toro nvidia-persistenced: Failed to lock PID file: Resource temporarily unavailable
2024-09-06T11:59:59.117104+00:00 toro nvidia-persistenced: Shutdown (1902386)
2024-09-06T12:00:43.320943+00:00 toro nvidia-persistenced: device 0000:01:00.0 - persistence mode disabled.
2024-09-06T12:00:43.321232+00:00 toro nvidia-persistenced: device 0000:01:00.0 - NUMA memory offlined.
2024-09-06T12:00:46.881181+00:00 toro nvidia-persistenced: Failed to lock PID file: Resource temporarily unavailable
2024-09-06T12:00:46.881291+00:00 toro nvidia-persistenced: Shutdown (1902413)
2024-09-06T12:01:13.921359+00:00 toro nvidia-persistenced: Failed to lock PID file: Resource temporarily unavailable
2024-09-06T12:01:13.921592+00:00 toro nvidia-persistenced: Shutdown (1902426)

I'm not sure if this is the root cause for the issues I'm experiencing, hope these information will helps to troubleshoot and provide a solution.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions