Skip to content

ImagePullBackOff caused by redundant information from the operator #647

@uhthomas

Description

@uhthomas

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Talos v1.6.1
  • Kernel Version: 6.1.69
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): 1.29.0 - Talos
  • GPU Operator Version: 23.9.1

2. Issue or feature description

The operator tries to pull invalid images as it includes redundant information like the kernel and os?

❯ k describe po nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  56s               default-scheduler  Successfully assigned nvidia-gpu-operator/nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd to rhode
  Normal   Pulled     18s               kubelet            Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5" already present on machine
  Normal   Created    18s               kubelet            Created container k8s-driver-manager
  Normal   Started    18s               kubelet            Started container k8s-driver-manager
  Normal   BackOff    15s               kubelet            Back-off pulling image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1"
  Warning  Failed     15s               kubelet            Error: ImagePullBackOff
  Normal   Pulling    4s (x2 over 17s)  kubelet            Pulling image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1"
  Warning  Failed     2s (x2 over 16s)  kubelet            Failed to pull image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": failed to resolve reference "nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1": nvcr.io/nvidia/driver:535.129.03-6.1.69-talos-talosv1.6.1: not found
  Warning  Failed     2s (x2 over 16s)  kubelet            Error: ErrImagePull
❯ k get po
NAME                                                     READY   STATUS             RESTARTS      AGE
gpu-feature-discovery-pgc7c                              0/1     Init:0/1           0             2m47s
nvidia-container-toolkit-daemonset-lw22k                 0/1     Init:0/1           0             2m47s
nvidia-dcgm-exporter-qg6j7                               0/1     Init:0/1           0             2m47s
nvidia-device-plugin-daemonset-m8z55                     0/1     Init:0/1           0             2m47s
nvidia-driver-daemonset-6.1.69-talos-talosv1.6.1-xgcqd   0/1     ImagePullBackOff   0             3m25s
nvidia-gpu-operator-79c7dc6d5-8dhhx                      1/1     Running            7 (13m ago)   2d19h
nvidia-operator-validator-xnbhr                          0/1     Init:0/4           0             2m47s

3. Steps to reproduce the issue

Deploy the GPU operator with the default configuration on a Talos Kubernetes cluster.

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions