Skip to content

Nvidia driver daemonset does not run due to apt-cache issue. #1244

@ScottWatsonWork

Description

@ScottWatsonWork

Hello,

We are currently running operator version: 24.6.2

The driver version we are trying to run is 550-5.15.0-1078-azure.

However, the nvidia-driver init script step is failing for the daemonset nvidia-driver-daemonset

from the logs I see the following

Get:57 http://us.archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 nvidia-driver-550-server amd64 550.127.08-0ubuntu0.22.04.1 [489 kB]
Ign:6 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-minimal amd64 3.10.12-1~22.04.7
Ign:7 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10-minimal amd64 3.10.12-1~22.04.7
Ign:11 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-stdlib amd64 3.10.12-1~22.04.7
Ign:12 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10 amd64 3.10.12-1~22.04.7
Err:6 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-minimal amd64 3.10.12-1~22.04.7
  404  Not Found [IP: 91.189.91.81 80]
Err:7 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10-minimal amd64 3.10.12-1~22.04.7
  404  Not Found [IP: 91.189.91.81 80]
Err:11 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 libpython3.10-stdlib amd64 3.10.12-1~22.04.7
  404  Not Found [IP: 91.189.91.81 80]
Err:12 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3.10 amd64 3.10.12-1~22.04.7
  404  Not Found [IP: 91.189.91.81 80]
Fetched 291 MB in 8s (35.3 MB/s)
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/libpython3.10-minimal_3.10.12-1%7e22.04.7_amd64.deb  404  Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/python3.10-minimal_3.10.12-1%7e22.04.7_amd64.deb  404  Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/libpython3.10-stdlib_3.10.12-1%7e22.04.7_amd64.deb  404  Not Found [IP: 91.189.91.81 80]
E: Failed to fetch http://us.archive.ubuntu.com/ubuntu/pool/main/p/python3.10/python3.10_3.10.12-1%7e22.04.7_amd64.deb  404  Not Found [IP: 91.189.91.81 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

It seems that is cannot access the us.archive.ubuntu.com for whatever reason. I have pulled the image locally on my desktop and can replicate the same problem.

podman pull nvcr.io/nvidia/driver:550-5.15.0-1078-azure-ubuntu22.04

podman run -it --rm --entrypoint /bin/bash nvcr.io/nvidia/driver:550-5.15.0-1078-azure-ubuntu22.04

#now install a package or run nvidia-driver init 
apt-get install vim 

or 
mkdir /run/nvidia 
nvidia-driver init

and you will get the error about the NOT FOUND IP. I have seen the following IPs listed. [91.189.91.82, 91.189.91.83, 185.125.190.82, 185.125.190.81] for each run of apt-get install -y vim.

However, if I run an apt-get update then I don't have this problem and the install works. I don't know how to get my gpu-operator to run the daemonset and make sure that the apt-get update is run. Maybe this is just a problem with the image itself or maybe nvidia-driver should have apt-get udpate before it tries to install the packages

from the nvidia-driver shell script which is the entrypoint of the driver daemonset

# Link and install the kernel modules from a precompiled packages
_install_driver() {
    # Install necessary userspace, fabric manager and libnvidia-nscq packages
    apt-get install -y --no-install-recommends nvidia-driver-${DRIVER_BRANCH}-server

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions