-
Notifications
You must be signed in to change notification settings - Fork 412
Description
See log below, specifically lines with Current value of AUTO_UPGRADE_POLICY_ENABLED=true and Auto eviction of GPU pods .. , Auto drain ...:
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Current value of AUTO_UPGRADE_POLICY_ENABLED=true'
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/aks-gputest-50947407-vmss000001 labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-kqw2v condition met
Waiting for the container-toolkit to shutdown
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Auto eviction of GPU pods on node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy
Unloading NVIDIA driver kernel modules...
nvidia_modeset 1306624 0
nvidia_uvm 1527808 4
nvidia 56717312 143 nvidia_uvm,nvidia_modeset
drm 622592 3 drm_kms_helper,nvidia,hyperv_drm
i2c_core 90112 3 drm_kms_helper,nvidia,drm
Could not unload NVIDIA driver kernel modules, driver is in use
Auto drain of the node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy
Failed to uninstall nvidia driver components
Auto eviction of GPU pods on node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy
Auto drain of the node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/aks-gputest-50947407-vmss000001 labeled
just by reading the logic in https://github.com/NVIDIA/k8s-driver-manager/blob/master/driver-manager
(_is_driver_auto_upgrade_policy_enabled in particlar) it should not happen, I believe
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
- Kernel Version: 5.15.0-1066
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): AKS
- GPU Operator Version: 24.3.0
2. Issue or feature description
Auto eviction and auto drain checks are further evaluated and not stopped by _is_driver_auto_upgrade_policy_enabled
3. Steps to reproduce the issue
Default install with Upgrade Controller enabled.
Kill the nvidia-driver-daemonset pod and trigger the driver reinstall on a node with GPU enabled pods already running.
4. Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]