You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: gpu-operator/release-notes.rst
+20-4Lines changed: 20 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,14 +57,30 @@ New Features
57
57
58
58
- 580.105.08 (default)
59
59
60
+
* Add HPC job mapping support to DCGM Exporter to collect metrics for HPC jobs running on the cluster.
61
+
62
+
Configure the HPC job mapping by setting the ``dcgmExporter.hpcJobMapping.enabled`` field to ``true`` in the ClusterPolicy custom resource.
63
+
Set ``dcgmExporter.hpcJobMapping.directory`` with the directory path where HPC job mapping files are created by the workload manager.
64
+
The default directory is ``/var/lib/dcgm-exporter/job-mapping``.
65
+
66
+
* Improved the cluster policy reconciler to be more resilient to race conditions during node updates.
67
+
60
68
Fixed Issues
61
69
------------
62
70
63
-
* Fixed a bug where driver images were being incorrectly assigned in multi-nodepool clusters.
64
-
* Fixed a bug where the GPU Operator Helm chart template was not assigning the correct namespace to resources it created.
65
-
* Fixed a bug where the ClusterPolicy reconciler would fail when it attempted to update node labels on a cluster.
66
-
* Fixed a bug where the k8s-driver-manager would wait indefinitely when MOFED is enabled despite the MOFED being pre-installed on the host.
71
+
* Fixed the following known issue introduced in GPU Operator v25.10.0:
67
72
73
+
* When using cri-o as the container runtime, several GPU Operator pods can be stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state during GPU Operator installation or upgrade, or during GPU driver daemonset upgrade.
74
+
* NVIDIA Container Toolkit 1.18.0 overwrites the imports field in the top-level containerd configuration file, so any previously imported paths are lost.
75
+
This was fixed in NVIDIA Container Toolkit v1.18.1.
76
+
77
+
* Fixed a race condition where user-supplied NVIDIA kernel module parameters were sometimes not being applied by the driver daemonset.
78
+
For more information, refer to `PR #1939 <https://github.com/NVIDIA/gpu-operator/pull/1939>`__.
79
+
80
+
* Fixed a bug where driver images were being incorrectly assigned in multi-nodepool clusters.
81
+
For more information, refer to `Issue #1622 <https://github.com/NVIDIA/gpu-operator/issues/1622>`__.
82
+
* Fixed a bug where the GPU Operator Helm chart template was not assigning the correct namespace to resources it created.
83
+
* Fixed a bug where the k8s-driver-manager would wait indefinitely when MOFED is enabled and ``USE_HOST_MOFED`` is set to true despite the MOFED being pre-installed on the host.
0 commit comments