-
Notifications
You must be signed in to change notification settings - Fork 412
Open
Labels
needs-triageissue or PR has not been assigned a priority-px labelissue or PR has not been assigned a priority-px label
Description
1. Quick Debug Information
- GPU Operator Version: v23.9.2
2. Issue or feature description
InitContainers have non configurable and explicitely empty resources resources: {}
3. Steps to reproduce the issue
Applied cluster policy:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
ccManager:
defaultMode: 'off'
enabled: false
env: []
image: k8s-cc-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.1.1
cdi:
default: false
enabled: false
daemonsets:
labels:
app.kubernetes.io/managed-by: gpu-operator
helm.sh/chart: gpu-operator-v23.9.2
priorityClassName: system-node-critical
rollingUpdate:
maxUnavailable: '1'
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
updateStrategy: RollingUpdate
dcgm:
enabled: false
hostPort: 5555
image: dcgm
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: 3.3.0-1-ubuntu22.04
dcgmExporter:
enabled: true
env:
- name: DCGM_EXPORTER_LISTEN
value: ':9400'
- name: DCGM_EXPORTER_KUBERNETES
value: 'true'
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcp-metrics-included.csv
image: dcgm-exporter
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/k8s
resources:
limits:
cpu: 500m
memory: 250Mi
requests:
cpu: 200m
memory: 200Mi
serviceMonitor:
additionalLabels: {}
enabled: true
honorLabels: true
interval: 15s
relabelings: []
version: 3.3.0-3.2.0-ubuntu22.04
devicePlugin:
config:
name: time-slicing-config
enabled: true
env:
- name: PASS_DEVICE_SPECS
value: 'true'
- name: FAIL_ON_INIT_ERROR
value: 'true'
- name: DEVICE_LIST_STRATEGY
value: envvar
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
image: k8s-device-plugin
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia
resources:
limits:
cpu: 500m
memory: 250Mi
requests:
cpu: 200m
memory: 200Mi
version: v0.14.0
driver:
certConfig:
name: ''
enabled: true
image: driver
imagePullPolicy: IfNotPresent
kernelModuleConfig:
name: ''
licensingConfig:
configMapName: ''
nlsEnabled: true
manager:
env:
- name: ENABLE_GPU_POD_EVICTION
value: 'true'
- name: ENABLE_AUTO_DRAIN
value: 'false'
- name: DRAIN_USE_FORCE
value: 'false'
- name: DRAIN_POD_SELECTOR_LABEL
value: ''
- name: DRAIN_TIMEOUT_SECONDS
value: 0s
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: 'false'
image: k8s-driver-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.6.5
rdma:
enabled: false
useHostMofed: false
repoConfig:
configMapName: ''
repository: nvcr.io/nvidia
resources:
limits:
cpu: 500m
memory: 250Mi
requests:
cpu: 200m
memory: 200Mi
startupProbe:
failureThreshold: 120
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 60
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
useNvidiaDriverCRD: false
useOpenKernelModules: false
usePrecompiled: false
version: 550.54.14
virtualTopology:
config: ''
gfd:
enabled: true
env:
- name: GFD_SLEEP_INTERVAL
value: 60s
- name: GFD_FAIL_ON_INIT_ERROR
value: 'true'
image: gpu-feature-discovery
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia
resources:
limits:
cpu: 500m
memory: 250Mi
requests:
cpu: 200m
memory: 200Mi
version: v0.8.2-ubi8
kataManager:
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
runtimeClasses:
- artifacts:
pullSecret: ''
url: >-
nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535.54.03
name: kata-qemu-nvidia-gpu
nodeSelector: {}
- artifacts:
pullSecret: ''
url: >-
nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535.86.10-snp
name: kata-qemu-nvidia-gpu-snp
nodeSelector:
nvidia.com/cc.capable: 'true'
enabled: false
image: k8s-kata-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.1.2
mig:
strategy: single
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: false
env:
- name: WITH_REBOOT
value: 'false'
gpuClientsConfig:
name: ''
image: k8s-mig-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.6.0-ubuntu20.04
nodeStatusExporter:
enabled: false
image: gpu-operator-validator
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v23.9.2
operator:
defaultRuntime: containerd
initContainer:
image: cuda
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia
version: 12.3.2-base-ubi8
runtimeClass: nvidia
psa:
enabled: false
psp:
enabled: false
sandboxDevicePlugin:
image: kubevirt-gpu-device-plugin
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia
version: v1.2.4
sandboxWorkloads:
defaultWorkload: container
enabled: false
toolkit:
enabled: true
image: container-toolkit
imagePullPolicy: IfNotPresent
installDir: /usr/local/nvidia
repository: nvcr.io/nvidia/k8s
resources:
limits:
cpu: 500m
memory: 250Mi
requests:
cpu: 200m
memory: 200Mi
version: v1.14.6-ubuntu20.04
validator:
image: gpu-operator-validator
imagePullPolicy: IfNotPresent
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
repository: nvcr.io/nvidia/cloud-native
resources:
limits:
cpu: 500m
memory: 250Mi
requests:
cpu: 200m
memory: 200Mi
version: v23.9.2
vfioManager:
driverManager:
env:
- name: ENABLE_GPU_POD_EVICTION
value: 'false'
- name: ENABLE_AUTO_DRAIN
value: 'false'
image: k8s-driver-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.6.2
enabled: false
image: cuda
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia
version: 12.3.2-base-ubi8
vgpuDeviceManager:
config:
default: default
name: ''
enabled: false
image: vgpu-device-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.2.4
vgpuManager:
driverManager:
env:
- name: ENABLE_GPU_POD_EVICTION
value: 'false'
- name: ENABLE_AUTO_DRAIN
value: 'false'
image: k8s-driver-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
version: v0.6.4
enabled: false
image: vgpu-manager
imagePullPolicy: IfNotPresent
Resulting Daemonset (one of them as example):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-operator-validator
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-operator-validator
app.kubernetes.io/part-of: gpu-operator
template:
metadata:
creationTimestamp: null
labels:
app: nvidia-operator-validator
app.kubernetes.io/managed-by: gpu-operator
app.kubernetes.io/part-of: gpu-operator
helm.sh/chart: gpu-operator-v23.9.2
spec:
volumes:
- name: run-nvidia-validations
hostPath:
path: /run/nvidia/validations
type: DirectoryOrCreate
- name: driver-install-path
hostPath:
path: /run/nvidia/driver
type: ''
- name: host-root
hostPath:
path: /
type: ''
- name: host-dev-char
hostPath:
path: /dev/char
type: ''
initContainers:
- name: driver-validation
image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
command:
- sh
- '-c'
args:
- nvidia-validator
env:
- name: WITH_WAIT
value: 'true'
- name: COMPONENT
value: driver
resources: {}
volumeMounts:
- name: host-root
readOnly: true
mountPath: /host
mountPropagation: HostToContainer
- name: driver-install-path
mountPath: /run/nvidia/driver
mountPropagation: HostToContainer
- name: run-nvidia-validations
mountPath: /run/nvidia/validations
mountPropagation: Bidirectional
- name: host-dev-char
mountPath: /host-dev-char
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
seLinuxOptions:
level: s0
- name: toolkit-validation
image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
command:
- sh
- '-c'
args:
- nvidia-validator
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: WITH_WAIT
value: 'false'
- name: COMPONENT
value: toolkit
resources: {}
volumeMounts:
- name: run-nvidia-validations
mountPath: /run/nvidia/validations
mountPropagation: Bidirectional
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
- name: cuda-validation
image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
command:
- sh
- '-c'
args:
- nvidia-validator
env:
- name: WITH_WAIT
value: 'false'
- name: COMPONENT
value: cuda
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: OPERATOR_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: VALIDATOR_IMAGE
value: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
- name: VALIDATOR_IMAGE_PULL_POLICY
value: IfNotPresent
- name: VALIDATOR_RUNTIME_CLASS
value: nvidia
resources: {}
volumeMounts:
- name: run-nvidia-validations
mountPath: /run/nvidia/validations
mountPropagation: Bidirectional
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
- name: plugin-validation
image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
command:
- sh
- '-c'
args:
- nvidia-validator
env:
- name: COMPONENT
value: plugin
- name: WITH_WAIT
value: 'false'
- name: WITH_WORKLOAD
value: 'false'
- name: MIG_STRATEGY
value: single
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: OPERATOR_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: VALIDATOR_IMAGE
value: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
- name: VALIDATOR_IMAGE_PULL_POLICY
value: IfNotPresent
- name: VALIDATOR_RUNTIME_CLASS
value: nvidia
resources: {}
volumeMounts:
- name: run-nvidia-validations
mountPath: /run/nvidia/validations
mountPropagation: Bidirectional
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
containers:
- name: nvidia-operator-validator
image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
command:
- sh
- '-c'
args:
- echo all validations are successful; sleep infinity
resources:
limits:
cpu: 500m
memory: 250Mi
requests:
cpu: 200m
memory: 200Mi
volumeMounts:
- name: run-nvidia-validations
mountPath: /run/nvidia/validations
mountPropagation: Bidirectional
lifecycle:
preStop:
exec:
command:
- /bin/sh
- '-c'
- rm -f /run/nvidia/validations/*-ready
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
nodeSelector:
nvidia.com/gpu.deploy.operator-validator: 'true'
serviceAccountName: nvidia-operator-validator
serviceAccount: nvidia-operator-validator
securityContext: {}
schedulerName: default-scheduler
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
priorityClassName: system-node-critical
runtimeClassName: nvidia
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 0
revisionHistoryLimit: 10
Is this intended?
Even if it's non configurable, wouldn't it make much more sense to just don't add resources (which would just assume default k8s behaviour)?
Thank you
Metadata
Metadata
Assignees
Labels
needs-triageissue or PR has not been assigned a priority-px labelissue or PR has not been assigned a priority-px label