AUTO_UPGRADE_POLICY_ENABLED set to true, but eviction and drain are "disabled by the upgrade policy"

See log below, specifically lines with `Current value of AUTO_UPGRADE_POLICY_ENABLED=true` and `Auto eviction of GPU pods ..` , `Auto drain ...`:
```
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label                                                                                                                             
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'                                                                                                                                               
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label                                                                                                                              
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'                                                                                                                                                
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label                                                                                                                                  
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'                                                                                                                                                    
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label                                                                                                                          
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'                                                                                                                                            
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label                                                                                                                                  
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'                                                                                                                                                    
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label                                                                                                                                           
Current value of 'nvidia.com/gpu.deploy.dcgm=true'                                                                                                                                                             
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label                                                                                                                                    
Current value of 'nvidia.com/gpu.deploy.mig-manager='                                                                                                                                                          
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label                                                                                                                                           
Current value of 'nvidia.com/gpu.deploy.nvsm='                                                                                                                                                                 
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label                                                                                                                              
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='                                                                                                                                                    
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label                                                                                                                          
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='                                                                                                                                                
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label                                                                                                                            
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='                                                                                                                                                  
Current value of AUTO_UPGRADE_POLICY_ENABLED=true'                                                                                                                                                             
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels                                                                                                    
node/aks-gputest-50947407-vmss000001 labeled                                                                                                                                                                   
Waiting for the operator-validator to shutdown                                                                                                                                                                 
pod/nvidia-operator-validator-kqw2v condition met                                                                                                                                                              
Waiting for the container-toolkit to shutdown                                                                                                                                                                  
Waiting for the device-plugin to shutdown                                                                                                                                                                      
Waiting for gpu-feature-discovery to shutdown                                                                                                                                                                  
Waiting for dcgm-exporter to shutdown                                                                                                                                                                          
Waiting for dcgm to shutdown                                                                                                                                                                                   
Auto eviction of GPU pods on node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                            
Unloading NVIDIA driver kernel modules...                                                                                                                                                                      
nvidia_modeset       1306624  0                                                                                                                                                                                
nvidia_uvm           1527808  4                                                                                                                                                                                
nvidia              56717312  143 nvidia_uvm,nvidia_modeset                                                                                                                                                    
drm                   622592  3 drm_kms_helper,nvidia,hyperv_drm                                                                                                                                               
i2c_core               90112  3 drm_kms_helper,nvidia,drm                                                                                                                                                      
Could not unload NVIDIA driver kernel modules, driver is in use                                                                                                                                                
Auto drain of the node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                                       
Failed to uninstall nvidia driver components                                                                                                                                                                   
Auto eviction of GPU pods on node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                            
Auto drain of the node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                                       
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels                                                                                                      
node/aks-gputest-50947407-vmss000001 labeled             
```
just by reading the logic in https://github.com/NVIDIA/k8s-driver-manager/blob/master/driver-manager
(`_is_driver_auto_upgrade_policy_enabled` in particlar)  it should not happen, I believe

### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
* Kernel Version: 5.15.0-1066
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): AKS
* GPU Operator Version: 24.3.0

### 2. Issue or feature description
Auto eviction and auto drain checks are further evaluated and not stopped by `_is_driver_auto_upgrade_policy_enabled`

### 3. Steps to reproduce the issue
Default install with Upgrade Controller enabled.
Kill the `nvidia-driver-daemonset` pod and  trigger the driver reinstall on a node with GPU enabled pods already running.

### 4. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
 - [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
 - [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`


Collecting full debug bundle (optional):

```
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
```
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh) script for debug data collected.

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AUTO_UPGRADE_POLICY_ENABLED set to true, but eviction and drain are "disabled by the upgrade policy" #901

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AUTO_UPGRADE_POLICY_ENABLED set to true, but eviction and drain are "disabled by the upgrade policy" #901

Description

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions