diff --git a/gpu-operator/life-cycle-policy.rst b/gpu-operator/life-cycle-policy.rst index 00ab3de0a..a65f08c93 100644 --- a/gpu-operator/life-cycle-policy.rst +++ b/gpu-operator/life-cycle-policy.rst @@ -87,9 +87,10 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. :header-rows: 2 * - :rspan:`1` Component - - GPU Operator Version + - :cspan:`2` GPU Operator Version * - v25.10.0 + - v25.10.1 * - NVIDIA GPU Driver |ki|_ - | `580.95.05 `_ (**D**, **R**) @@ -98,32 +99,44 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. | `570.195.03 `_ | `550.163.01 `_ | `535.274.02 `_ + - | `580.105.08 `_ (**D**, **R**) + | `580.95.05 `_ + | `580.82.07 `_ + | `575.57.08 `_ + | `570.195.03 `_ + | `550.163.01 `_ + | `535.274.02 `_ * - NVIDIA Driver Manager for Kubernetes - `v0.9.0 `__ + - `v0.9.1 `__ * - NVIDIA Container Toolkit - `1.18.0 `__ * - NVIDIA Kubernetes Device Plugin - `0.18.0 `__ + - `0.18.1 `__ * - DCGM Exporter - `v4.4.1-4.6.0 `__ + - `v4.4.2-4.7.0 `__ * - Node Feature Discovery - `v0.18.2 `__ * - | NVIDIA GPU Feature Discovery | for Kubernetes - - `0.18.0 `__ + - `0.18.1 `__ * - NVIDIA MIG Manager for Kubernetes - `0.13.0 `__ + - `0.13.1 `__ * - DCGM - `4.4.1 `__ + - `4.4.2-1 `__ * - Validator for NVIDIA GPU Operator - v25.10.0 @@ -169,4 +182,5 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. version downloaded from the `NVIDIA Licensing Portal `_. - The GPU Operator is supported on all active NVIDIA data center production drivers. Refer to `Supported Drivers and CUDA Toolkit Versions `_ - for more information. \ No newline at end of file + for more information. + diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index f879d2464..bd4634d96 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -33,6 +33,56 @@ Refer to the :ref:`GPU Operator Component Matrix` for a list of software compone ---- + + +.. _v25.10.1: + +25.10.1 +======= + +New Features +------------ + +* Updated software component versions: + + - NVIDIA Container Toolkit v1.18.1 + - NVIDIA DCGM v4.4.2-1 + - NVIDIA DCGM Exporter v4.4.2-4.7.0 + - NVIDIA Kubernetes Device Plugin v0.18.1 + - NVIDIA GPU Feature Discovery v0.18.1 + - NVIDIA MIG Manager for Kubernetes 0.13.1 + - NVIDIA Driver Manager for Kubernetes v0.9.1 + +* Added support for this NVIDIA Data Center GPU Driver version: + + - 580.105.08 (default) + +* Add HPC job mapping support to DCGM Exporter to collect metrics for HPC jobs running on the cluster. + + Configure the HPC job mapping by setting the ``dcgmExporter.hpcJobMapping.enabled`` field to ``true`` in the ClusterPolicy custom resource. + Set ``dcgmExporter.hpcJobMapping.directory`` with the directory path where HPC job mapping files are created by the workload manager. + The default directory is ``/var/lib/dcgm-exporter/job-mapping``. + +* Improved the cluster policy reconciler to be more resilient to race conditions during node updates. + +Fixed Issues +------------ + +* Fixed the following known issue introduced in GPU Operator v25.10.0: + + * When using cri-o as the container runtime, several GPU Operator pods can be stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state during GPU Operator installation or upgrade, or during GPU driver daemonset upgrade. + * NVIDIA Container Toolkit 1.18.0 overwrites the imports field in the top-level containerd configuration file, so any previously imported paths are lost. + This was fixed in NVIDIA Container Toolkit v1.18.1. + +* Fixed a race condition where user-supplied NVIDIA kernel module parameters were sometimes not being applied by the driver daemonset. + For more information, refer to `PR #1939 `__. + +* Fixed a bug where driver images were being incorrectly assigned in multi-nodepool clusters. + For more information, refer to `Issue #1622 `__. +* Fixed a bug where the GPU Operator Helm chart template was not assigning the correct namespace to resources it created. +* Fixed a bug where the k8s-driver-manager would wait indefinitely when MOFED is enabled and ``USE_HOST_MOFED`` is set to true despite the MOFED being pre-installed on the host. + + .. _v25.10.0: 25.10.0 diff --git a/repo.toml b/repo.toml index 916b60fac..3f042018b 100644 --- a/repo.toml +++ b/repo.toml @@ -168,7 +168,7 @@ docs_root = "${root}/gpu-operator" project = "gpu-operator" name = "NVIDIA GPU Operator" version = "25.10" # Update repo_docs.projects.openshift.version to match latest patch version maj.min.patch -source_substitutions = { minor_version = "25.10", version = "v25.10.0", recommended = "580.95.05" } +source_substitutions = { minor_version = "25.10", version = "v25.10.1", recommended = "580.105.08" } copyright_start = 2020 sphinx_exclude_patterns = [ "life-cycle-policy.rst",