NVFP4 dequantization by aris134 · Pull Request #505 · ROCm/TransformerEngine

aris134 · 2026-03-25T16:22:33Z

Description

Fixes https://github.com/ROCm/frameworks-internal/issues/15998

Enable NVFP4 dequantization on AMD GPU (gfx950) and add unit test.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Enable compilation of NVFP4 dequantization kernel for AMD GPU
Add unit test that verifies NVFP4 dequantization works on gfx950

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

matthiasdiener · 2026-03-26T14:57:21Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+    ASSERT_EQ(err, hipSuccess) << hipGetErrorString(err);
+
+    const float amax = 1.0f;
+    input.set_tensor_amax(amax);


set_scale() instead?

Yeah, I think for dequantization, the scale is needed

This leads to memory fault run-time error, whereas my current method (set_tensor_amax) works fine. Leaving as is for now.

Please double-check. Quantization does not need amax, dequant should not have it either.

ipanfilo

It is based on PR#472. Not to review the same changes twice let's wait for that PR to merge

tests/cpp/operator/test_dequantize_nvfp4.cu

wangye805 · 2026-03-27T16:23:51Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+    ASSERT_EQ(err, hipSuccess) << hipGetErrorString(err);
+
+    const float amax = 1.0f;
+    input.set_tensor_amax(amax);


Yeah, I think for dequantization, the scale is needed

tests/cpp/operator/test_dequantize_nvfp4.cu

matthiasdiener · 2026-04-02T19:08:50Z

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu

+#ifdef __HIP_PLATFORM_AMD__
+static __device__ constexpr uint64_t WARP_REDUCE_AMAX_GROUP_MASKS[8] = {
+    0x0101010101010101ULL, 0x0202020202020202ULL,
+    0x0404040404040404ULL, 0x0808080808080808ULL,
+    0x1010101010101010ULL, 0x2020202020202020ULL,
+    0x4040404040404040ULL, 0x8080808080808080ULL};
+#else
 static __device__ constexpr unsigned int WARP_REDUCE_AMAX_GROUP_MASKS[8] = {
    0x01010101, 0x02020202, 0x04040404, 0x08080808, 0x10101010, 0x20202020, 0x40404040, 0x80808080};
+#endif

 // max for every group_size elements in warp
 template <int group_size, int shfl_down_stride>
-__device__ __forceinline__ float groupMax(float val, unsigned int groupMask) {
+__device__ __forceinline__ float groupMax(float val,
+#ifdef __HIP_PLATFORM_AMD__
+                                          uint64_t groupMask) {
+#else
+                                          unsigned int groupMask) {
+#endif


I think the changes in this file are due to a merge error and should not be necessary.

matthiasdiener · 2026-04-02T19:14:39Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+size_t divide_round_up(size_t x, size_t y) {
+    return (x + y - 1) / y;
+}


Isn't this part of test_common.h?

wangye805 · 2026-04-03T17:13:34Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+                const uint8_t bits = static_cast<uint8_t>(dis(gen));
+
+                fp8e4m3 candidate;
+                std::memcpy(&candidate, &bits, sizeof(bits));
+
+                const float decoded = static_cast<float>(candidate);
+                if (std::isfinite(decoded)) {
+                    scale_buffer[idx] = candidate;
+                    break;
+                }


This section of codes generating a valid fp8e4m3 are reused in the 2d scale as well, let's consolidate them to avoid maintaining duplicated copies

wangye805 · 2026-04-03T17:16:40Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+        for (size_t block = 0; block < mathematical_blocks_per_row; ++block) {
+            const size_t idx = row * physical_row_stride + block;
+
+            while (true) {


By the way, is there a way to generate fp8e4m3 without using a while loop try? I understand with multiple-tryout, we will finally find a non-infinite fp8e4m3, but it's a little bit waste for the random seed and execution time. Fp8e4m3 is well documented, probably we can study which bit-patterns give non-infinite values?

wangye805 · 2026-04-03T17:17:05Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+                std::memcpy(&candidate, &bits, sizeof(bits));
+
+                const float decoded = static_cast<float>(candidate);
+                if (std::isfinite(decoded)) {


Scales also need to be non-negative, right?

wangye805 · 2026-04-03T17:17:58Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+    generate_1d_scales(host_scales_rowwise_1d.get(),
+                    unpadded_blocks_Y,
+                    unpadded_blocks_X,
+                    scales_stride,
+                    gen,
+                    fp8_dis);


wangye805 · 2026-04-03T17:18:09Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+    generate_1d_scales(host_scales_colwise_1d.get(),
+                    unpadded_blocks_Y_t,
+                    unpadded_blocks_X_t,
+                    scales_stride_t,
+                    gen,
+                    fp8_dis);


wangye805 · 2026-04-03T17:20:16Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+                     const size_t mathematical_rows,
+                     const size_t mathematical_blocks_per_row,
+                     const size_t physical_row_stride,


You can reuse the existing names (unpadded_blocks_Y, unpadded_blocks_X, and scales_stride)

wangye805 · 2026-04-03T17:22:58Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+}
+
+// Decode a single FP4 (E2M1) value from packed storage.
+float get_fp4_value(const fp4e2m1* data, const size_t mathematical_idx) {


Only the scale has the padding/alignment distinction, mathemtical (unpadded) idx vs padded index. I recall for rowwise /columnwise data, we don't have this padding issue? If so, we can mathematical_idx -> idx

wangye805 · 2026-04-03T17:37:50Z

tests/cpp/test_common.h

+    float *amax_gpu = nullptr;
+    NVTE_CHECK_CUDA(cudaMalloc(&amax_gpu, sizeof(float)));
+    NVTE_CHECK_CUDA(cudaMemcpy(amax_gpu, amax_cpu_data_.get(),
+                              sizeof(float), cudaMemcpyHostToDevice));
+
+    tensor_.set_amax(amax_gpu, DType::kFloat32, tensor_.defaultShape);


use from_cpu()

TransformerEngine/tests/cpp/test_common.cu

Line 481 in 1d0a70e

void Tensor::from_cpu() const {

wangye805 · 2026-04-03T17:38:18Z

tests/cpp/test_common.h

    tensor_.set_amax(nullptr, DType::kFloat32, tensor_.defaultShape);
  }

+  void set_tensor_amax(float amax) {


Guard our rocm specific changes by macro

wangye805 · 2026-04-03T17:39:19Z

tests/cpp/test_common.h

 constexpr size_t scale_tensor_alignment_X_colwise = 128;
 #endif

+static constexpr float E2M1_LUT[16] = {


rocm specific contents

aris134 self-assigned this Mar 26, 2026

aris134 marked this pull request as ready for review March 26, 2026 13:16

aris134 requested review from ipanfilo, wangye805 and wenchenvincent as code owners March 26, 2026 13:16

matthiasdiener reviewed Mar 26, 2026

View reviewed changes

ipanfilo reviewed Mar 26, 2026

View reviewed changes

wangye805 requested changes Mar 27, 2026

View reviewed changes

Rebase onto dev

1d0a70e

aris134 force-pushed the amartin/nvfp4-dequant branch from 2682291 to 1d0a70e Compare April 2, 2026 15:12

matthiasdiener added the ci-level 1 CI test level 1 label Apr 2, 2026

matthiasdiener reviewed Apr 2, 2026

View reviewed changes

matthiasdiener requested changes Apr 2, 2026

View reviewed changes

wangye805 requested changes Apr 6, 2026

View reviewed changes

Conversation

aris134 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aris134 Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ipanfilo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aris134 commented Mar 25, 2026 •

edited

Loading

aris134 Apr 2, 2026 •

edited

Loading