Skip to content

Commit 4d468dc

Browse files
Add gke known issue (#312)
* Add gke known issue Signed-off-by: Abigail McCarthy <[email protected]> * Apply suggestions from code review Co-authored-by: Christopher Desiniotis <[email protected]> Signed-off-by: Abigail McCarthy <[email protected]> --------- Signed-off-by: Abigail McCarthy <[email protected]> Co-authored-by: Christopher Desiniotis <[email protected]>
1 parent 49b10b4 commit 4d468dc

File tree

2 files changed

+25
-0
lines changed

2 files changed

+25
-0
lines changed

gpu-operator/google-gke.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,20 @@ Prerequisites
8080
Refer to `GPU platforms <https://cloud.google.com/compute/docs/gpus>`_
8181
in the Google Cloud documentation.
8282

83+
.. note::
84+
85+
When installing NVIDIA GPU Operator on GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly.
86+
87+
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue.
88+
You can set this environment variable by setting the below in the ClusterPolicy CR:
89+
90+
.. code-block:: yaml
91+
92+
toolkit:
93+
env:
94+
- name: RUNTIME_CONFIG_SOURCE
95+
value: "file"
96+
8397
8498
*********************************
8599
Using the Google Driver Installer

gpu-operator/release-notes.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,17 @@ Known Issues
188188
189189
Create the ConfigMap, then update the ClusterPolicy with the name of the configMap in the ``vgpuDeviceManager.config.name``, and restart the vgpu-device-manager pod.
190190

191+
- When using GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly.
192+
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue.
193+
You can set this environment variable by setting the below in the ClusterPolicy CR:
194+
195+
.. code-block:: yaml
196+
197+
toolkit:
198+
env:
199+
- name: RUNTIME_CONFIG_SOURCE
200+
value: "file"
201+
191202
.. _v25.3.4:
192203

193204
25.3.4

0 commit comments

Comments
 (0)