-
Notifications
You must be signed in to change notification settings - Fork 412
Description
Hello,
We are currently running operator version: 24.6.2
The driver version we are trying to run is 550-5.15.0-1078-azure.
However, the nvidia-driver init script step is failing for the daemonset nvidia-driver-daemonset
from the logs I see the following
Get:57 http://us.archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 nvidia-driver-550-server amd64 550.127.08-0ubuntu0.22.04.1 [489 kB]
Ign:6 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-minimal amd64 3.10.12-1~22.04.7
Ign:7 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10-minimal amd64 3.10.12-1~22.04.7
Ign:11 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-stdlib amd64 3.10.12-1~22.04.7
Ign:12 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10 amd64 3.10.12-1~22.04.7
Err:6 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-minimal amd64 3.10.12-1~22.04.7
404 Not Found [IP: 91.189.91.81 80]
Err:7 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10-minimal amd64 3.10.12-1~22.04.7
404 Not Found [IP: 91.189.91.81 80]
Err:11 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-stdlib amd64 3.10.12-1~22.04.7
404 Not Found [IP: 91.189.91.81 80]
Err:12 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10 amd64 3.10.12-1~22.04.7
404 Not Found [IP: 91.189.91.81 80]
Fetched 291 MB in 8s (35.3 MB/s)
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/libpython3.10-minimal_3.10.12-1%7e22.04.7_amd64.deb 404 Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/python3.10-minimal_3.10.12-1%7e22.04.7_amd64.deb 404 Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/libpython3.10-stdlib_3.10.12-1%7e22.04.7_amd64.deb 404 Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/python3.10_3.10.12-1%7e22.04.7_amd64.deb 404 Not Found [IP: 91.189.91.81 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
It seems that is cannot access the us.archive.ubuntu.com for whatever reason. I have pulled the image locally on my desktop and can replicate the same problem.
podman pull nvcr.io/nvidia/driver:550-5.15.0-1078-azure-ubuntu22.04
podman run -it --rm --entrypoint /bin/bash nvcr.io/nvidia/driver:550-5.15.0-1078-azure-ubuntu22.04
#now install a package or run nvidia-driver init
apt-get install vim
or
mkdir /run/nvidia
nvidia-driver init
and you will get the error about the NOT FOUND IP. I have seen the following IPs listed. [91.189.91.82, 91.189.91.83, 185.125.190.82, 185.125.190.81] for each run of apt-get install -y vim.
However, if I run an apt-get update then I don't have this problem and the install works. I don't know how to get my gpu-operator to run the daemonset and make sure that the apt-get update is run. Maybe this is just a problem with the image itself or maybe nvidia-driver should have apt-get udpate before it tries to install the packages
from the nvidia-driver shell script which is the entrypoint of the driver daemonset
# Link and install the kernel modules from a precompiled packages
_install_driver() {
# Install necessary userspace, fabric manager and libnvidia-nscq packages
apt-get install -y --no-install-recommends nvidia-driver-${DRIVER_BRANCH}-server