Skip to content

Excessive runtime logging could cause Kubernetes workload deployment failure #511

@weistonedawei

Description

@weistonedawei

Observed Kubernetes workload deployment failure caused by excessive logging in /run/containerd/io.containerd.runtime.v2.task/k8s.io//log.json file. This leads to /run tmpfs mount to be at 100% utilization, which prevents further container creation on the affected node.

When container spec uses exec livenessProbe, the following log entries will be logged:

{"level":"info","msg":"Running with config:\n{\n  \"DisableRequire\": false,\n  \"SwarmResource\": \"\",\n  \"AcceptEnvvarUnprivileged\": true,\n  \"AcceptDeviceListAsVolumeMounts\": false,\n  \"SupportedDriverCapabilities\": \"compat32,compute,display,graphics,ngx,utility,video\",\n  \"NVIDIAContainerCLIConfig\": {\n    \"Root\": \"/run/nvidia/driver\",\n    \"Path\": \"/usr/local/nvidia/toolkit/nvidia-container-cli\",\n    \"Environment\": [],\n    \"Debug\": \"\",\n    \"Ldcache\": \"\",\n    \"LoadKmods\": true,\n    \"NoPivot\": false,\n    \"NoCgroups\": false,\n    \"User\": \"\",\n    \"Ldconfig\": \"@/run/nvidia/driver/sbin/ldconfig.real\"\n  },\n  \"NVIDIACTKConfig\": {\n    \"Path\": \"/usr/local/nvidia/toolkit/nvidia-ctk\"\n  },\n  \"NVIDIAContainerRuntimeConfig\": {\n    \"DebugFilePath\": \"/dev/null\",\n    \"LogLevel\": \"info\",\n    \"Runtimes\": [\n      \"docker-runc\",\n      \"runc\"\n    ],\n    \"Mode\": \"cdi\",\n    \"Modes\": {\n      \"CSV\": {\n        \"MountSpecPath\": \"/etc/nvidia-container-runtime/host-files-for-container.d\"\n      },\n      \"CDI\": {\n        \"SpecDirs\": [\n          \"/etc/cdi\",\n          \"/var/run/cdi\"\n        ],\n        \"DefaultKind\": \"management.nvidia.com/gpu\",\n        \"AnnotationPrefixes\": [\n          \"nvidia.cdi.k8s.io/\"\n        ]\n      }\n    }\n  },\n  \"NVIDIAContainerRuntimeHookConfig\": {\n    \"Path\": \"/usr/local/nvidia/toolkit/nvidia-container-runtime-hook\",\n    \"SkipModeDetection\": true\n  }\n}","time":"2024-05-24T20:31:18+02:00"}
{"level":"info","msg":"Using low-level runtime /usr/sbin/runc","time":"2024-05-24T20:31:18+02:00"}

A sample container spec:

spec:
  containers:
  - image: kubeflow/ml-pipeline/visualization-server:2.0.0-alpha.7
    livenessProbe:
      exec:
        command:
        - wget
        - -q
        - -S
        - -O
        - '-'
        - http://localhost:8888/
    name: ml-pipeline-visualizationserver

The log entries come from runtime.go starting on line 75 and from runtime_low_level.go code.

IMHO, setting log level to DEBUG should be fine; it would allow easy debugging and not affecting functionalities.

Current workaround used is to set log-level = "error" in /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml.


accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "cdi"
  runtimes = ["docker-runc", "runc"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.cdi]
      annotation-prefixes = ["nvidia.cdi.k8s.io/"]
      default-kind = "management.nvidia.com/gpu"
      spec-dirs = ["/etc/cdi", "/var/run/cdi"]

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
  path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook"
  skip-mode-detection = true

[nvidia-ctk]
  path = "/usr/local/nvidia/toolkit/nvidia-ctk"

I used gpu-operator in the Kubernetes cluster and here is the runtime version info:

cd /usr/local/nvidia/toolkit

./nvidia-container-runtime --version
NVIDIA Container Runtime version 1.14.3
commit: 53b24618a542025b108239fe602e66e912b7d6e2
spec: 1.1.0-rc.2

runc version 1.1.12
commit: v1.1.12-0-g51d5e946
spec: 1.0.2-dev
go: go1.20.13
libseccomp: 2.5.4

Attempted to create /etc/nvidia-container-runtime/config.toml to override log-level did not work.

Metadata

Metadata

Assignees

Labels

lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions