Skip to content

Commit 605ca64

Browse files
committed
add in known issues and component updates
Signed-off-by: Abigail McCarthy <[email protected]>
1 parent a64fbe5 commit 605ca64

File tree

2 files changed

+21
-5
lines changed

2 files changed

+21
-5
lines changed

gpu-operator/life-cycle-policy.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
128128

129129
* - | NVIDIA GPU Feature Discovery
130130
| for Kubernetes
131-
- `0.18.0 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
131+
- `0.18.1 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
132132

133133
* - NVIDIA MIG Manager for Kubernetes
134134
- `0.13.0 <https://github.com/NVIDIA/mig-parted/blob/main/CHANGELOG.md>`__

gpu-operator/release-notes.rst

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -57,14 +57,30 @@ New Features
5757

5858
- 580.105.08 (default)
5959

60+
* Add HPC job mapping support to DCGM Exporter to collect metrics for HPC jobs running on the cluster.
61+
62+
Configure the HPC job mapping by setting the ``dcgmExporter.hpcJobMapping.enabled`` field to ``true`` in the ClusterPolicy custom resource.
63+
Set ``dcgmExporter.hpcJobMapping.directory`` with the directory path where HPC job mapping files are created by the workload manager.
64+
The default directory is ``/var/lib/dcgm-exporter/job-mapping``.
65+
66+
* Improved the cluster policy reconciler to be more resilient to race conditions during node updates.
67+
6068
Fixed Issues
6169
------------
6270

63-
* Fixed a bug where driver images were being incorrectly assigned in multi-nodepool clusters.
64-
* Fixed a bug where the GPU Operator Helm chart template was not assigning the correct namespace to resources it created.
65-
* Fixed a bug where the ClusterPolicy reconciler would fail when it attempted to update node labels on a cluster.
66-
* Fixed a bug where the k8s-driver-manager would wait indefinitely when MOFED is enabled despite the MOFED being pre-installed on the host.
71+
* Fixed the following known issue introduced in GPU Operator v25.10.0:
6772

73+
* When using cri-o as the container runtime, several GPU Operator pods can be stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state during GPU Operator installation or upgrade, or during GPU driver daemonset upgrade.
74+
* NVIDIA Container Toolkit 1.18.0 overwrites the imports field in the top-level containerd configuration file, so any previously imported paths are lost.
75+
This was fixed in NVIDIA Container Toolkit v1.18.1.
76+
77+
* Fixed a race condition where user-supplied NVIDIA kernel module parameters were sometimes not being applied by the driver daemonset.
78+
For more information, refer to `PR #1939 <https://github.com/NVIDIA/gpu-operator/pull/1939>`__.
79+
80+
* Fixed a bug where driver images were being incorrectly assigned in multi-nodepool clusters.
81+
For more information, refer to `Issue #1622 <https://github.com/NVIDIA/gpu-operator/issues/1622>`__.
82+
* Fixed a bug where the GPU Operator Helm chart template was not assigning the correct namespace to resources it created.
83+
* Fixed a bug where the k8s-driver-manager would wait indefinitely when MOFED is enabled and ``USE_HOST_MOFED`` is set to true despite the MOFED being pre-installed on the host.
6884

6985

7086
.. _v25.10.0:

0 commit comments

Comments
 (0)