Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions gpu-operator/life-cycle-policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,10 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
:header-rows: 2

* - :rspan:`1` Component
- GPU Operator Version
- :cspan:`2` GPU Operator Version

* - v25.10.0
- v25.10.1

* - NVIDIA GPU Driver |ki|_
- | `580.95.05 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-95-05/index.html>`_ (**D**, **R**)
Expand All @@ -98,32 +99,44 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
| `570.195.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-195-03/index.html>`_
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
| `535.274.02 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-274-03/index.html>`_
- | `580.105.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-105-08/index.html>`_ (**D**, **R**)
| `580.95.05 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-95-05/index.html>`_
| `580.82.07 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-82-07/index.html>`_
| `575.57.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-575-57-08/index.html>`_
| `570.195.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-195-03/index.html>`_
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
| `535.274.02 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-274-03/index.html>`_


* - NVIDIA Driver Manager for Kubernetes
- `v0.9.0 <https://ngc.nvidia.com/catalog/containers/nvidia:cloud-native:k8s-driver-manager>`__
- `v0.9.1 <https://ngc.nvidia.com/catalog/containers/nvidia:cloud-native:k8s-driver-manager>`__

* - NVIDIA Container Toolkit
- `1.18.0 <https://github.com/NVIDIA/nvidia-container-toolkit/releases>`__

* - NVIDIA Kubernetes Device Plugin
- `0.18.0 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
- `0.18.1 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__

* - DCGM Exporter
- `v4.4.1-4.6.0 <https://github.com/NVIDIA/dcgm-exporter/releases>`__
- `v4.4.2-4.7.0 <https://github.com/NVIDIA/dcgm-exporter/releases>`__

* - Node Feature Discovery
- `v0.18.2 <https://github.com/kubernetes-sigs/node-feature-discovery/releases/>`__

* - | NVIDIA GPU Feature Discovery
| for Kubernetes
- `0.18.0 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
- `0.18.1 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__

* - NVIDIA MIG Manager for Kubernetes
- `0.13.0 <https://github.com/NVIDIA/mig-parted/blob/main/CHANGELOG.md>`__
- `0.13.1 <https://github.com/NVIDIA/mig-parted/blob/main/CHANGELOG.md>`__

* - DCGM
- `4.4.1 <https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html>`__
- `4.4.2-1 <https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html>`__

* - Validator for NVIDIA GPU Operator
- v25.10.0
Expand Down Expand Up @@ -169,4 +182,5 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
version downloaded from the `NVIDIA Licensing Portal <https://ui.licensing.nvidia.com>`_.
- The GPU Operator is supported on all active NVIDIA data center production drivers.
Refer to `Supported Drivers and CUDA Toolkit Versions <https://docs.nvidia.com/datacenter/tesla/drivers/index.html#supported-drivers-and-cuda-toolkit-versions>`_
for more information.
for more information.

50 changes: 50 additions & 0 deletions gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,56 @@ Refer to the :ref:`GPU Operator Component Matrix` for a list of software compone

----



.. _v25.10.1:

25.10.1
=======

New Features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call out NVIDIA/gpu-operator#1894 as a new feature as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in

------------

* Updated software component versions:

- NVIDIA Container Toolkit v1.18.1
- NVIDIA DCGM v4.4.2-1
- NVIDIA DCGM Exporter v4.4.2-4.7.0
- NVIDIA Kubernetes Device Plugin v0.18.1
- NVIDIA GPU Feature Discovery v0.18.1
- NVIDIA MIG Manager for Kubernetes 0.13.1
- NVIDIA Driver Manager for Kubernetes v0.9.1

* Added support for this NVIDIA Data Center GPU Driver version:

- 580.105.08 (default)

* Add HPC job mapping support to DCGM Exporter to collect metrics for HPC jobs running on the cluster.

Configure the HPC job mapping by setting the ``dcgmExporter.hpcJobMapping.enabled`` field to ``true`` in the ClusterPolicy custom resource.
Set ``dcgmExporter.hpcJobMapping.directory`` with the directory path where HPC job mapping files are created by the workload manager.
The default directory is ``/var/lib/dcgm-exporter/job-mapping``.

* Improved the cluster policy reconciler to be more resilient to race conditions during node updates.

Fixed Issues
------------

* Fixed the following known issue introduced in GPU Operator v25.10.0:

* When using cri-o as the container runtime, several GPU Operator pods can be stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state during GPU Operator installation or upgrade, or during GPU driver daemonset upgrade.
* NVIDIA Container Toolkit 1.18.0 overwrites the imports field in the top-level containerd configuration file, so any previously imported paths are lost.
This was fixed in NVIDIA Container Toolkit v1.18.1.

* Fixed a race condition where user-supplied NVIDIA kernel module parameters were sometimes not being applied by the driver daemonset.
For more information, refer to `PR #1939 <https://github.com/NVIDIA/gpu-operator/pull/1939>`__.

* Fixed a bug where driver images were being incorrectly assigned in multi-nodepool clusters.
For more information, refer to `Issue #1622 <https://github.com/NVIDIA/gpu-operator/issues/1622>`__.
* Fixed a bug where the GPU Operator Helm chart template was not assigning the correct namespace to resources it created.
* Fixed a bug where the k8s-driver-manager would wait indefinitely when MOFED is enabled and ``USE_HOST_MOFED`` is set to true despite the MOFED being pre-installed on the host.


.. _v25.10.0:

25.10.0
Expand Down
2 changes: 1 addition & 1 deletion repo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ docs_root = "${root}/gpu-operator"
project = "gpu-operator"
name = "NVIDIA GPU Operator"
version = "25.10" # Update repo_docs.projects.openshift.version to match latest patch version maj.min.patch
source_substitutions = { minor_version = "25.10", version = "v25.10.0", recommended = "580.95.05" }
source_substitutions = { minor_version = "25.10", version = "v25.10.1", recommended = "580.105.08" }
copyright_start = 2020
sphinx_exclude_patterns = [
"life-cycle-policy.rst",
Expand Down