Skip to content

InitContainers have non configurable and explicitely empty resrources #702

@miguelglopes

Description

@miguelglopes

1. Quick Debug Information

  • GPU Operator Version: v23.9.2

2. Issue or feature description

InitContainers have non configurable and explicitely empty resources resources: {}

3. Steps to reproduce the issue

Applied cluster policy:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  ccManager:
    defaultMode: 'off'
    enabled: false
    env: []
    image: k8s-cc-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.1.1
  cdi:
    default: false
    enabled: false
  daemonsets:
    labels:
      app.kubernetes.io/managed-by: gpu-operator
      helm.sh/chart: gpu-operator-v23.9.2
    priorityClassName: system-node-critical
    rollingUpdate:
      maxUnavailable: '1'
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists
    updateStrategy: RollingUpdate
  dcgm:
    enabled: false
    hostPort: 5555
    image: dcgm
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: 3.3.0-1-ubuntu22.04
  dcgmExporter:
    enabled: true
    env:
      - name: DCGM_EXPORTER_LISTEN
        value: ':9400'
      - name: DCGM_EXPORTER_KUBERNETES
        value: 'true'
      - name: DCGM_EXPORTER_COLLECTORS
        value: /etc/dcgm-exporter/dcp-metrics-included.csv
    image: dcgm-exporter
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/k8s
    resources:
      limits:
        cpu: 500m
        memory: 250Mi
      requests:
        cpu: 200m
        memory: 200Mi
    serviceMonitor:
      additionalLabels: {}
      enabled: true
      honorLabels: true
      interval: 15s
      relabelings: []
    version: 3.3.0-3.2.0-ubuntu22.04
  devicePlugin:
    config:
      name: time-slicing-config
    enabled: true
    env:
      - name: PASS_DEVICE_SPECS
        value: 'true'
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
      - name: DEVICE_LIST_STRATEGY
        value: envvar
      - name: DEVICE_ID_STRATEGY
        value: uuid
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    image: k8s-device-plugin
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    resources:
      limits:
        cpu: 500m
        memory: 250Mi
      requests:
        cpu: 200m
        memory: 200Mi
    version: v0.14.0
  driver:
    certConfig:
      name: ''
    enabled: true
    image: driver
    imagePullPolicy: IfNotPresent
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: ''
      nlsEnabled: true
    manager:
      env:
        - name: ENABLE_GPU_POD_EVICTION
          value: 'true'
        - name: ENABLE_AUTO_DRAIN
          value: 'false'
        - name: DRAIN_USE_FORCE
          value: 'false'
        - name: DRAIN_POD_SELECTOR_LABEL
          value: ''
        - name: DRAIN_TIMEOUT_SECONDS
          value: 0s
        - name: DRAIN_DELETE_EMPTYDIR_DATA
          value: 'false'
      image: k8s-driver-manager
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia/cloud-native
      version: v0.6.5
    rdma:
      enabled: false
      useHostMofed: false
    repoConfig:
      configMapName: ''
    repository: nvcr.io/nvidia
    resources:
      limits:
        cpu: 500m
        memory: 250Mi
      requests:
        cpu: 200m
        memory: 200Mi
    startupProbe:
      failureThreshold: 120
      initialDelaySeconds: 60
      periodSeconds: 10
      timeoutSeconds: 60
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    useNvidiaDriverCRD: false
    useOpenKernelModules: false
    usePrecompiled: false
    version: 550.54.14
    virtualTopology:
      config: ''
  gfd:
    enabled: true
    env:
      - name: GFD_SLEEP_INTERVAL
        value: 60s
      - name: GFD_FAIL_ON_INIT_ERROR
        value: 'true'
    image: gpu-feature-discovery
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    resources:
      limits:
        cpu: 500m
        memory: 250Mi
      requests:
        cpu: 200m
        memory: 200Mi
    version: v0.8.2-ubi8
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
      runtimeClasses:
        - artifacts:
            pullSecret: ''
            url: >-
              nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535.54.03
          name: kata-qemu-nvidia-gpu
          nodeSelector: {}
        - artifacts:
            pullSecret: ''
            url: >-
              nvcr.io/nvidia/cloud-native/kata-gpu-artifacts:ubuntu22.04-535.86.10-snp
          name: kata-qemu-nvidia-gpu-snp
          nodeSelector:
            nvidia.com/cc.capable: 'true'
    enabled: false
    image: k8s-kata-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.1.2
  mig:
    strategy: single
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: false
    env:
      - name: WITH_REBOOT
        value: 'false'
    gpuClientsConfig:
      name: ''
    image: k8s-mig-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.6.0-ubuntu20.04
  nodeStatusExporter:
    enabled: false
    image: gpu-operator-validator
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v23.9.2
  operator:
    defaultRuntime: containerd
    initContainer:
      image: cuda
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia
      version: 12.3.2-base-ubi8
    runtimeClass: nvidia
  psa:
    enabled: false
  psp:
    enabled: false
  sandboxDevicePlugin:
    image: kubevirt-gpu-device-plugin
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: v1.2.4
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  toolkit:
    enabled: true
    image: container-toolkit
    imagePullPolicy: IfNotPresent
    installDir: /usr/local/nvidia
    repository: nvcr.io/nvidia/k8s
    resources:
      limits:
        cpu: 500m
        memory: 250Mi
      requests:
        cpu: 200m
        memory: 200Mi
    version: v1.14.6-ubuntu20.04
  validator:
    image: gpu-operator-validator
    imagePullPolicy: IfNotPresent
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
    repository: nvcr.io/nvidia/cloud-native
    resources:
      limits:
        cpu: 500m
        memory: 250Mi
      requests:
        cpu: 200m
        memory: 200Mi
    version: v23.9.2
  vfioManager:
    driverManager:
      env:
        - name: ENABLE_GPU_POD_EVICTION
          value: 'false'
        - name: ENABLE_AUTO_DRAIN
          value: 'false'
      image: k8s-driver-manager
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia/cloud-native
      version: v0.6.2
    enabled: false
    image: cuda
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia
    version: 12.3.2-base-ubi8
  vgpuDeviceManager:
    config:
      default: default
      name: ''
    enabled: false
    image: vgpu-device-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    version: v0.2.4
  vgpuManager:
    driverManager:
      env:
        - name: ENABLE_GPU_POD_EVICTION
          value: 'false'
        - name: ENABLE_AUTO_DRAIN
          value: 'false'
      image: k8s-driver-manager
      imagePullPolicy: IfNotPresent
      repository: nvcr.io/nvidia/cloud-native
      version: v0.6.4
    enabled: false
    image: vgpu-manager
    imagePullPolicy: IfNotPresent

Resulting Daemonset (one of them as example):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-operator-validator
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-operator-validator
      app.kubernetes.io/part-of: gpu-operator
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nvidia-operator-validator
        app.kubernetes.io/managed-by: gpu-operator
        app.kubernetes.io/part-of: gpu-operator
        helm.sh/chart: gpu-operator-v23.9.2
    spec:
      volumes:
        - name: run-nvidia-validations
          hostPath:
            path: /run/nvidia/validations
            type: DirectoryOrCreate
        - name: driver-install-path
          hostPath:
            path: /run/nvidia/driver
            type: ''
        - name: host-root
          hostPath:
            path: /
            type: ''
        - name: host-dev-char
          hostPath:
            path: /dev/char
            type: ''
      initContainers:
        - name: driver-validation
          image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
          command:
            - sh
            - '-c'
          args:
            - nvidia-validator
          env:
            - name: WITH_WAIT
              value: 'true'
            - name: COMPONENT
              value: driver
          resources: {}
          volumeMounts:
            - name: host-root
              readOnly: true
              mountPath: /host
              mountPropagation: HostToContainer
            - name: driver-install-path
              mountPath: /run/nvidia/driver
              mountPropagation: HostToContainer
            - name: run-nvidia-validations
              mountPath: /run/nvidia/validations
              mountPropagation: Bidirectional
            - name: host-dev-char
              mountPath: /host-dev-char
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            seLinuxOptions:
              level: s0
        - name: toolkit-validation
          image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
          command:
            - sh
            - '-c'
          args:
            - nvidia-validator
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: all
            - name: WITH_WAIT
              value: 'false'
            - name: COMPONENT
              value: toolkit
          resources: {}
          volumeMounts:
            - name: run-nvidia-validations
              mountPath: /run/nvidia/validations
              mountPropagation: Bidirectional
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
        - name: cuda-validation
          image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
          command:
            - sh
            - '-c'
          args:
            - nvidia-validator
          env:
            - name: WITH_WAIT
              value: 'false'
            - name: COMPONENT
              value: cuda
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: OPERATOR_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: VALIDATOR_IMAGE
              value: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
            - name: VALIDATOR_IMAGE_PULL_POLICY
              value: IfNotPresent
            - name: VALIDATOR_RUNTIME_CLASS
              value: nvidia
          resources: {}
          volumeMounts:
            - name: run-nvidia-validations
              mountPath: /run/nvidia/validations
              mountPropagation: Bidirectional
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
        - name: plugin-validation
          image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
          command:
            - sh
            - '-c'
          args:
            - nvidia-validator
          env:
            - name: COMPONENT
              value: plugin
            - name: WITH_WAIT
              value: 'false'
            - name: WITH_WORKLOAD
              value: 'false'
            - name: MIG_STRATEGY
              value: single
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: OPERATOR_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: VALIDATOR_IMAGE
              value: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
            - name: VALIDATOR_IMAGE_PULL_POLICY
              value: IfNotPresent
            - name: VALIDATOR_RUNTIME_CLASS
              value: nvidia
          resources: {}
          volumeMounts:
            - name: run-nvidia-validations
              mountPath: /run/nvidia/validations
              mountPropagation: Bidirectional
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
      containers:
        - name: nvidia-operator-validator
          image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
          command:
            - sh
            - '-c'
          args:
            - echo all validations are successful; sleep infinity
          resources:
            limits:
              cpu: 500m
              memory: 250Mi
            requests:
              cpu: 200m
              memory: 200Mi
          volumeMounts:
            - name: run-nvidia-validations
              mountPath: /run/nvidia/validations
              mountPropagation: Bidirectional
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - rm -f /run/nvidia/validations/*-ready
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      nodeSelector:
        nvidia.com/gpu.deploy.operator-validator: 'true'
      serviceAccountName: nvidia-operator-validator
      serviceAccount: nvidia-operator-validator
      securityContext: {}
      schedulerName: default-scheduler
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      priorityClassName: system-node-critical
      runtimeClassName: nvidia
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0
  revisionHistoryLimit: 10

Is this intended?
Even if it's non configurable, wouldn't it make much more sense to just don't add resources (which would just assume default k8s behaviour)?

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageissue or PR has not been assigned a priority-px label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions