-
Notifications
You must be signed in to change notification settings - Fork 437
Description
Observed Kubernetes workload deployment failure caused by excessive logging in /run/containerd/io.containerd.runtime.v2.task/k8s.io//log.json file. This leads to /run tmpfs mount to be at 100% utilization, which prevents further container creation on the affected node.
When container spec uses exec livenessProbe, the following log entries will be logged:
{"level":"info","msg":"Running with config:\n{\n \"DisableRequire\": false,\n \"SwarmResource\": \"\",\n \"AcceptEnvvarUnprivileged\": true,\n \"AcceptDeviceListAsVolumeMounts\": false,\n \"SupportedDriverCapabilities\": \"compat32,compute,display,graphics,ngx,utility,video\",\n \"NVIDIAContainerCLIConfig\": {\n \"Root\": \"/run/nvidia/driver\",\n \"Path\": \"/usr/local/nvidia/toolkit/nvidia-container-cli\",\n \"Environment\": [],\n \"Debug\": \"\",\n \"Ldcache\": \"\",\n \"LoadKmods\": true,\n \"NoPivot\": false,\n \"NoCgroups\": false,\n \"User\": \"\",\n \"Ldconfig\": \"@/run/nvidia/driver/sbin/ldconfig.real\"\n },\n \"NVIDIACTKConfig\": {\n \"Path\": \"/usr/local/nvidia/toolkit/nvidia-ctk\"\n },\n \"NVIDIAContainerRuntimeConfig\": {\n \"DebugFilePath\": \"/dev/null\",\n \"LogLevel\": \"info\",\n \"Runtimes\": [\n \"docker-runc\",\n \"runc\"\n ],\n \"Mode\": \"cdi\",\n \"Modes\": {\n \"CSV\": {\n \"MountSpecPath\": \"/etc/nvidia-container-runtime/host-files-for-container.d\"\n },\n \"CDI\": {\n \"SpecDirs\": [\n \"/etc/cdi\",\n \"/var/run/cdi\"\n ],\n \"DefaultKind\": \"management.nvidia.com/gpu\",\n \"AnnotationPrefixes\": [\n \"nvidia.cdi.k8s.io/\"\n ]\n }\n }\n },\n \"NVIDIAContainerRuntimeHookConfig\": {\n \"Path\": \"/usr/local/nvidia/toolkit/nvidia-container-runtime-hook\",\n \"SkipModeDetection\": true\n }\n}","time":"2024-05-24T20:31:18+02:00"}
{"level":"info","msg":"Using low-level runtime /usr/sbin/runc","time":"2024-05-24T20:31:18+02:00"}A sample container spec:
spec:
containers:
- image: kubeflow/ml-pipeline/visualization-server:2.0.0-alpha.7
livenessProbe:
exec:
command:
- wget
- -q
- -S
- -O
- '-'
- http://localhost:8888/
name: ml-pipeline-visualizationserverThe log entries come from runtime.go starting on line 75 and from runtime_low_level.go code.
IMHO, setting log level to DEBUG should be fine; it would allow easy debugging and not affecting functionalities.
Current workaround used is to set log-level = "error" in /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml.
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
[nvidia-container-cli]
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
[nvidia-container-runtime]
log-level = "info"
mode = "cdi"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["nvidia.cdi.k8s.io/"]
default-kind = "management.nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook"
skip-mode-detection = true
[nvidia-ctk]
path = "/usr/local/nvidia/toolkit/nvidia-ctk"
I used gpu-operator in the Kubernetes cluster and here is the runtime version info:
cd /usr/local/nvidia/toolkit
./nvidia-container-runtime --version
NVIDIA Container Runtime version 1.14.3
commit: 53b24618a542025b108239fe602e66e912b7d6e2
spec: 1.1.0-rc.2
runc version 1.1.12
commit: v1.1.12-0-g51d5e946
spec: 1.0.2-dev
go: go1.20.13
libseccomp: 2.5.4
Attempted to create /etc/nvidia-container-runtime/config.toml to override log-level did not work.