add in known issues and component updates

a-mccarthy · a-mccarthy · commit 605ca64452bd · 2025-12-04T01:23:21.000-05:00
Signed-off-by: Abigail McCarthy &lt;20771501+a-mccarthy@users.noreply.github.com&gt;
diff --git a/gpu-operator/life-cycle-policy.rst b/gpu-operator/life-cycle-policy.rst
@@ -128,7 +128,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
 
    * - | NVIDIA GPU Feature Discovery
        | for Kubernetes
-     - `0.18.0 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
+     - `0.18.1 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
 
    * - NVIDIA MIG Manager for Kubernetes
      - `0.13.0 <https://github.com/NVIDIA/mig-parted/blob/main/CHANGELOG.md>`__
diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst
@@ -57,14 +57,30 @@ New Features
 
   - 580.105.08 (default)
 
+* Add HPC job mapping support to DCGM Exporter to collect metrics for HPC jobs running on the cluster.
+
+  Configure the HPC job mapping by setting the ``dcgmExporter.hpcJobMapping.enabled`` field to ``true`` in the ClusterPolicy custom resource.
+  Set ``dcgmExporter.hpcJobMapping.directory`` with the directory path where HPC job mapping files are created by the workload manager. 
+  The default directory is ``/var/lib/dcgm-exporter/job-mapping``.
+
+* Improved the cluster policy reconciler to be more resilient to race conditions during node updates.
+
 Fixed Issues
 ------------
 
-* Fixed a bug where driver images were being incorrectly assigned in multi-nodepool clusters.
-* Fixed a bug where the GPU Operator Helm chart template was not assigning the correct namespace to resources it created.
-* Fixed a bug where the ClusterPolicy reconciler would fail when it attempted to update node labels on a cluster. 
-* Fixed a bug where the k8s-driver-manager would wait indefinitely when MOFED is enabled despite the MOFED being pre-installed on the host. 
+* Fixed the following known issue introduced in GPU Operator v25.10.0:
 
+  * When using cri-o as the container runtime, several GPU Operator pods can be stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state during GPU Operator installation or upgrade, or during GPU driver daemonset upgrade.
+  * NVIDIA Container Toolkit 1.18.0 overwrites the imports field in the top-level containerd configuration file, so any previously imported paths are lost.
+    This was fixed in NVIDIA Container Toolkit v1.18.1.
+
+* Fixed a race condition where user-supplied NVIDIA kernel module parameters were sometimes not being applied by the driver daemonset. 
+  For more information, refer to `PR #1939 <https://github.com/NVIDIA/gpu-operator/pull/1939>`__.
+
+* Fixed a bug where driver images were being incorrectly assigned in multi-nodepool clusters. 
+  For more information, refer to `Issue #1622 <https://github.com/NVIDIA/gpu-operator/issues/1622>`__.
+* Fixed a bug where the GPU Operator Helm chart template was not assigning the correct namespace to resources it created.
+* Fixed a bug where the k8s-driver-manager would wait indefinitely when MOFED is enabled and ``USE_HOST_MOFED`` is set to true despite the MOFED being pre-installed on the host. 
 
 
 .. _v25.10.0: