Skip to content

AUTO_UPGRADE_POLICY_ENABLED set to true, but eviction and drain are "disabled by the upgrade policy" #901

@futurewasfree

Description

@futurewasfree

See log below, specifically lines with Current value of AUTO_UPGRADE_POLICY_ENABLED=true and Auto eviction of GPU pods .. , Auto drain ...:

Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label                                                                                                                             
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'                                                                                                                                               
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label                                                                                                                              
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'                                                                                                                                                
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label                                                                                                                                  
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'                                                                                                                                                    
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label                                                                                                                          
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'                                                                                                                                            
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label                                                                                                                                  
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'                                                                                                                                                    
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label                                                                                                                                           
Current value of 'nvidia.com/gpu.deploy.dcgm=true'                                                                                                                                                             
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label                                                                                                                                    
Current value of 'nvidia.com/gpu.deploy.mig-manager='                                                                                                                                                          
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label                                                                                                                                           
Current value of 'nvidia.com/gpu.deploy.nvsm='                                                                                                                                                                 
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label                                                                                                                              
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='                                                                                                                                                    
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label                                                                                                                          
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='                                                                                                                                                
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label                                                                                                                            
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='                                                                                                                                                  
Current value of AUTO_UPGRADE_POLICY_ENABLED=true'                                                                                                                                                             
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels                                                                                                    
node/aks-gputest-50947407-vmss000001 labeled                                                                                                                                                                   
Waiting for the operator-validator to shutdown                                                                                                                                                                 
pod/nvidia-operator-validator-kqw2v condition met                                                                                                                                                              
Waiting for the container-toolkit to shutdown                                                                                                                                                                  
Waiting for the device-plugin to shutdown                                                                                                                                                                      
Waiting for gpu-feature-discovery to shutdown                                                                                                                                                                  
Waiting for dcgm-exporter to shutdown                                                                                                                                                                          
Waiting for dcgm to shutdown                                                                                                                                                                                   
Auto eviction of GPU pods on node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                            
Unloading NVIDIA driver kernel modules...                                                                                                                                                                      
nvidia_modeset       1306624  0                                                                                                                                                                                
nvidia_uvm           1527808  4                                                                                                                                                                                
nvidia              56717312  143 nvidia_uvm,nvidia_modeset                                                                                                                                                    
drm                   622592  3 drm_kms_helper,nvidia,hyperv_drm                                                                                                                                               
i2c_core               90112  3 drm_kms_helper,nvidia,drm                                                                                                                                                      
Could not unload NVIDIA driver kernel modules, driver is in use                                                                                                                                                
Auto drain of the node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                                       
Failed to uninstall nvidia driver components                                                                                                                                                                   
Auto eviction of GPU pods on node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                            
Auto drain of the node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                                       
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels                                                                                                      
node/aks-gputest-50947407-vmss000001 labeled             

just by reading the logic in https://github.com/NVIDIA/k8s-driver-manager/blob/master/driver-manager
(_is_driver_auto_upgrade_policy_enabled in particlar) it should not happen, I believe

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 5.15.0-1066
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): AKS
  • GPU Operator Version: 24.3.0

2. Issue or feature description

Auto eviction and auto drain checks are further evaluated and not stopped by _is_driver_auto_upgrade_policy_enabled

3. Steps to reproduce the issue

Default install with Upgrade Controller enabled.
Kill the nvidia-driver-daemonset pod and trigger the driver reinstall on a node with GPU enabled pods already running.

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions