Update mgpu sha and gate fusion doc (#3344)

1tnguyen · web-flow · commit 39cd95128554 · 2025-08-14T16:55:50.000+10:00
* Update mgpu sha and gate fusion doc

Signed-off-by: Thien Nguyen &lt;thiennguyen@nvidia.com&gt;

* Spell check fixes

Signed-off-by: Thien Nguyen &lt;thiennguyen@nvidia.com&gt;

---------

Signed-off-by: Thien Nguyen &lt;thiennguyen@nvidia.com&gt;
diff --git a/.github/workflows/config/gitlab_commits.txt b/.github/workflows/config/gitlab_commits.txt
@@ -1,2 +1,2 @@
 nvidia-mgpu-repo: cuda-quantum/cuquantum-mgpu.git
-nvidia-mgpu-commit: bfccb143f12b42be129ed2fbf16c39428eaba7b7
+nvidia-mgpu-commit: 5f1033f9efbe952633e567d676a64a237cb43ba7
diff --git a/docs/sphinx/using/backends/sims/svsims.rst b/docs/sphinx/using/backends/sims/svsims.rst
@@ -107,7 +107,7 @@ It is worth drawing attention to gate fusion, a powerful tool for improving simu
     - Description
   * - ``CUDAQ_FUSION_MAX_QUBITS``
     - positive integer
-    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC) and the floating point precision selected for the simulator. Specifically, for CC 8.0, 9.0, and 10.0 the defaults are `4`, `5`, and `5` for `FP32`. For `FP64` the corresponding defaults are `5`, `6`, and `4`. For all other CC, the default is `4` for both precision modes.
+    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC) and the floating point precision selected for the simulator as specified :ref:`here <gate-fusion-table>`.
   * - ``CUDAQ_FUSION_DIAGONAL_GATE_MAX_QUBITS``
     - integer greater than or equal to -1
     - The max number of qubits used for diagonal gate fusion. The default value is set to `-1` and the fusion size will be automatically adjusted for the better performance. If 0, the gate fusion for diagonal gates is disabled.
@@ -249,7 +249,7 @@ the multi-node multi-GPU configuration. Any environment variables must be set pr
     - The qubit count threshold where state vector distribution is activated. Below this threshold, simulation is performed as independent (non-distributed) tasks across all MPI processes for optimal performance. Default is 25. 
   * - ``CUDAQ_MGPU_FUSE``
     - positive integer
-    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC) and the floating point precision selected for the simulator. Specifically, for CC 8.0, 9.0, and 10.0 the defaults are `4`, `5`, and `5` for `FP32`. For `FP64` the corresponding defaults are `5`, `6`, and `4`. For all other CC, the default is `4` for both precision modes.
+    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC) and the floating point precision selected for the simulator as specified :ref:`here <gate-fusion-table>`. 
   * - ``CUDAQ_MGPU_P2P_DEVICE_BITS``
     - positive integer
     - Specify the number of GPUs that can communicate by using GPUDirect P2P. Default value is 0 (P2P communication is disabled).
@@ -270,6 +270,35 @@ the multi-node multi-GPU configuration. Any environment variables must be set pr
     The :code:`nvidia-mgpu` backend, which is equivalent to the multi-node multi-GPU double-precision option (`mgpu,fp64`) of the :code:`nvidia`
     is deprecated and will be removed in a future release.
 
+.. |:spellcheck-disable:| replace:: \
+.. |:spellcheck-enable:| replace:: \
+
+
+.. _gate-fusion-table:
+
+.. list-table:: **Default Gate Fusion Size**
+  :widths: 20 30 50
+
+  * - Compute Capability
+    - GPU 
+    - Default Gate Fusion Size
+  * - 8.0
+    - NVIDIA A100
+    - 4 (`fp32`) or 5 (`fp64`)
+  * - 9.0
+    - NVIDIA H100, H200, |:spellcheck-disable:| GH200 |:spellcheck-enable:| 
+    - 5 (`fp32`) or 6 (`fp64`)
+  * - 10.0
+    - NVIDIA GB200, B200
+    - 5 (`fp32`) or 4 (`fp64`)
+  * - 10.3
+    - NVIDIA B300
+    - 5 (`fp32`) or 1 (`fp64`)
+  * - Others
+    - 
+    - 4 (`fp32` and `fp64`)
+
+
 The above configuration options of the :code:`nvidia` backend 
 can be tuned to reduce your simulation runtimes. One of the
 performance improvements is to fuse multiple gates together during runtime. For

Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`	`1`	`nvidia-mgpu-repo: cuda-quantum/cuquantum-mgpu.git`
`2`		`-nvidia-mgpu-commit: bfccb143f12b42be129ed2fbf16c39428eaba7b7`
	`2`	`+nvidia-mgpu-commit: 5f1033f9efbe952633e567d676a64a237cb43ba7`