Ct opt by alextmagro · Pull Request #519 · ROCm/TransformerEngine

alextmagro · 2026-04-04T02:20:47Z

Improvements to cast_transpose and cast for FP8 delayed scaling

Introduced rocm specific cast and cast+transpose functions tuned for MI350s and MI300s

For memory-bound kernels:
Cast Only: 2.85x speedup on average
Cast Transpose: 2.0x speedup on average

This PR contains benchmarking scripts, so was branched off of #507.

alextmagro · 2026-04-04T02:22:04Z

Claude results breakdown and process analysis

Hardware

AMD Instinct MI355X (CDNA 4, gfx950, 256 CUs, 288 GB HBM3E)
Peak HBM bandwidth: 7.276 TiB/s

Bandwidth Calculation

Path	FP32 bytes/elem	BF16 bytes/elem
Cast-only	5 (4r + 1w)	3 (2r + 1w)
Cast+transpose	6 (4r + 1w + 1w)	4 (2r + 1w + 1w)

Cast-Only

FP32->FP8 GPT-OSS

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
2880x2880	29.8 us	17.4%	11.0 us	47.1%	2.71x
5120x2880	49.3 us	18.7%	17.2 us	53.6%	2.87x
5760x2880	53.6 us	19.4%	18.2 us	57.0%	2.95x
16384x2880	140.0 us	21.0%	37.1 us	79.6%	3.77x
2880x4096	40.4 us	18.3%	15.3 us	48.1%	2.64x
16384x4096	199.0 us	21.1%	49.1 us	85.5%	4.05x
16384x5120	247.0 us	21.3%	75.0 us	69.9%	3.29x

Arithmetic mean speedup: 3.18x | Geometric mean: 3.14x | Weighted (by elems): 3.50x
Optimized BW — Arithmetic: 4.58 Ti/s (63.0%) | Peak: 6.22 Ti/s (85.5%)

FP32->FP8 LLM (Llama/Qwen)

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
1024x3584	16.5 us	13.9%	9.0 us	25.4%	1.83x
1024x4096	18.1 us	14.5%	9.3 us	28.2%	1.94x
2048x4096	30.2 us	17.4%	11.2 us	46.9%	2.70x
4096x4096	54.9 us	19.1%	18.6 us	56.5%	2.95x
1024x8192	30.5 us	17.2%	11.3 us	46.6%	2.70x
2048x8192	54.0 us	19.4%	18.3 us	57.3%	2.95x
4096x8192	102.0 us	20.6%	28.4 us	73.7%	3.59x
8192x8192	199.0 us	21.1%	49.1 us	85.5%	4.05x
16384x8192	391.0 us	21.5%	119.0 us	70.6%	3.29x
32768x8192	775.0 us	21.6%	235.0 us	71.3%	3.30x
1024x14336	47.9 us	19.1%	17.3 us	52.9%	2.77x
2048x14336	90.8 us	20.2%	26.1 us	70.2%	3.48x
4096x14336	176.0 us	20.9%	43.8 us	83.8%	4.02x
4096x16384	198.0 us	21.1%	49.0 us	85.6%	4.04x
8192x16384	391.0 us	21.5%	119.0 us	70.5%	3.29x
16384x16384	775.0 us	21.6%	236.0 us	71.1%	3.28x
32768x16384	817.0 us	41.1%	472.0 us	71.1%	1.73x
1024x18944	62.1 us	19.5%	20.0 us	60.5%	3.10x
2048x28672	176.0 us	20.9%	43.8 us	83.9%	4.02x
4096x28672	342.0 us	21.4%	105.0 us	70.2%	3.26x
8192x28672	678.0 us	21.6%	206.0 us	71.2%	3.29x
16384x28672	800.0 us	36.7%	411.0 us	71.5%	1.95x
2048x29568	179.0 us	21.2%	44.9 us	84.3%	3.99x
8192x29568	700.0 us	21.6%	213.0 us	71.2%	3.29x
8192x53248	796.0 us	34.3%	384.0 us	71.0%	2.07x

Arithmetic mean speedup: 3.07x | Geometric mean: 2.98x | Weighted (by elems): 2.53x
Optimized BW — Arithmetic: 4.80 Ti/s (66.0%) | Peak: 6.23 Ti/s (85.6%)

FP32->FP8 GPT-OSS MoE

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
64x2880	6.9 us	1.7%	7.1 us	1.6%	0.97x
256x2880	7.9 us	5.8%	7.9 us	5.8%	1.00x
320x2880	8.3 us	6.9%	8.2 us	7.1%	1.02x
496x2880	9.7 us	9.2%	8.6 us	10.4%	1.13x
1792x2880	20.5 us	15.7%	9.6 us	33.8%	2.15x
64x5760	7.1 us	3.3%	7.1 us	3.2%	0.99x
256x5760	10.0 us	9.2%	8.4 us	11.0%	1.19x
320x5760	10.8 us	10.6%	8.5 us	13.6%	1.27x
496x5760	13.8 us	12.9%	8.7 us	20.4%	1.58x
1792x5760	36.4 us	17.7%	14.5 us	44.5%	2.51x

Arithmetic mean speedup: 1.38x | Geometric mean: 1.30x | Weighted (by elems): 2.11x
Optimized BW — Arithmetic: 1.10 Ti/s (15.1%) | Peak: 3.24 Ti/s (44.5%)

BF16->FP8 GPT-OSS

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
2880x2880	18.7 us	16.6%	10.1 us	30.8%	1.85x
5120x2880	28.3 us	19.5%	11.7 us	47.4%	2.42x
5760x2880	31.2 us	20.0%	13.5 us	45.9%	2.31x
16384x2880	74.8 us	23.7%	26.6 us	66.6%	2.81x
2880x4096	23.4 us	18.9%	10.9 us	40.4%	2.15x
16384x4096	104.0 us	24.1%	34.2 us	73.7%	3.04x
16384x5120	127.0 us	24.7%	40.8 us	77.2%	3.11x

Arithmetic mean speedup: 2.53x | Geometric mean: 2.49x | Weighted (by elems): 2.89x
Optimized BW — Arithmetic: 3.97 Ti/s (54.6%) | Peak: 5.62 Ti/s (77.2%)

BF16->FP8 LLM (Llama/Qwen)

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
1024x3584	11.8 us	11.6%	8.9 us	15.4%	1.32x
1024x4096	12.6 us	12.4%	9.2 us	17.0%	1.36x
2048x4096	18.9 us	16.6%	10.2 us	30.9%	1.85x
4096x4096	31.5 us	20.0%	14.1 us	44.7%	2.23x
1024x8192	18.9 us	16.6%	10.2 us	30.9%	1.85x
2048x8192	31.1 us	20.2%	13.9 us	45.3%	2.24x
4096x8192	55.7 us	22.6%	21.4 us	58.8%	2.60x
8192x8192	104.0 us	24.1%	34.5 us	73.0%	3.01x
16384x8192	202.0 us	24.9%	60.3 us	83.5%	3.35x
32768x8192	398.0 us	25.3%	144.0 us	70.1%	2.76x
1024x14336	27.9 us	19.7%	11.8 us	46.8%	2.36x
2048x14336	49.7 us	22.2%	19.9 us	55.4%	2.50x
4096x14336	92.8 us	23.7%	31.1 us	70.7%	2.98x
4096x16384	103.0 us	24.3%	34.3 us	73.4%	3.00x
8192x16384	203.0 us	24.8%	60.4 us	83.3%	3.36x
16384x16384	398.0 us	25.3%	141.0 us	71.3%	2.82x
32768x16384	787.0 us	25.6%	286.0 us	70.5%	2.75x
1024x18944	34.9 us	20.8%	15.9 us	45.8%	2.19x
2048x28672	92.9 us	23.7%	31.1 us	70.7%	2.99x
4096x28672	178.0 us	24.8%	53.9 us	81.7%	3.30x
8192x28672	349.0 us	25.2%	124.0 us	71.0%	2.81x
16384x28672	693.0 us	25.4%	249.0 us	70.8%	2.78x
2048x29568	93.8 us	24.2%	31.7 us	71.6%	2.96x
8192x29568	359.0 us	25.3%	128.0 us	70.7%	2.80x
8192x53248	643.0 us	25.4%	228.0 us	71.7%	2.82x

Arithmetic mean speedup: 2.60x | Geometric mean: 2.53x | Weighted (by elems): 2.85x
Optimized BW — Arithmetic: 4.35 Ti/s (59.8%) | Peak: 6.08 Ti/s (83.5%)

BF16->FP8 GPT-OSS MoE

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
64x2880	6.9 us	1.0%	7.1 us	1.0%	0.97x
256x2880	7.4 us	3.8%	7.7 us	3.6%	0.96x
320x2880	7.6 us	4.6%	8.0 us	4.3%	0.95x
496x2880	8.2 us	6.6%	8.4 us	6.4%	0.97x
1792x2880	13.7 us	14.1%	9.4 us	20.7%	1.46x
64x5760	6.9 us	2.0%	7.2 us	1.9%	0.97x
256x5760	8.3 us	6.7%	8.3 us	6.6%	1.00x
320x5760	8.8 us	7.9%	8.4 us	8.2%	1.04x
496x5760	10.3 us	10.5%	8.7 us	12.3%	1.18x
1792x5760	21.0 us	18.4%	10.5 us	36.7%	2.00x

Arithmetic mean speedup: 1.15x | Geometric mean: 1.12x | Weighted (by elems): 1.68x
Optimized BW — Arithmetic: 0.74 Ti/s (10.2%) | Peak: 2.67 Ti/s (36.7%)

Cast+Transpose

FP32->FP8 GPT-OSS

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
2880x2880	33.4 us	18.6%	18.0 us	34.6%	1.86x
5120x2880	39.0 us	28.3%	21.9 us	50.4%	1.78x
5760x2880	41.9 us	29.7%	27.2 us	45.8%	1.54x
16384x2880	61.3 us	57.7%	48.4 us	73.2%	1.27x
2880x4096	44.3 us	20.0%	21.1 us	41.9%	2.10x
16384x4096	120.0 us	41.9%	68.8 us	73.2%	1.74x
16384x5120	115.0 us	54.6%	104.0 us	60.5%	1.11x

Arithmetic mean speedup: 1.63x | Geometric mean: 1.59x | Weighted (by elems): 1.38x
Optimized BW — Arithmetic: 3.94 Ti/s (54.2%) | Peak: 5.32 Ti/s (73.2%)

FP32->FP8 LLM (Llama/Qwen)

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
1024x3584	18.0 us	15.3%	9.1 us	30.1%	1.97x
1024x4096	19.6 us	16.1%	9.3 us	33.7%	2.10x
2048x4096	29.0 us	21.7%	12.1 us	51.9%	2.40x
4096x4096	29.4 us	42.9%	23.7 us	53.2%	1.24x
1024x8192	29.0 us	21.7%	12.1 us	52.0%	2.40x
2048x8192	28.3 us	44.4%	20.7 us	60.8%	1.37x
4096x8192	46.5 us	54.1%	35.7 us	70.6%	1.30x
8192x8192	119.0 us	42.3%	64.9 us	77.6%	1.83x
16384x8192	230.0 us	43.8%	199.0 us	50.5%	1.16x
32768x8192	443.0 us	45.4%	387.0 us	52.0%	1.14x
1024x14336	39.0 us	28.3%	18.6 us	59.2%	2.10x
2048x14336	41.4 us	53.3%	30.1 us	73.1%	1.38x
4096x14336	83.4 us	52.8%	54.7 us	80.5%	1.52x
4096x16384	103.0 us	48.8%	66.1 us	76.1%	1.56x
8192x16384	195.0 us	51.6%	179.0 us	56.1%	1.09x
16384x16384	403.0 us	50.0%	377.0 us	53.4%	1.07x
32768x16384	789.0 us	51.0%	703.0 us	57.2%	1.12x
1024x18944	33.2 us	43.8%	23.1 us	62.9%	1.44x
2048x28672	84.3 us	52.3%	55.0 us	80.1%	1.53x
4096x28672	163.0 us	54.0%	138.0 us	64.0%	1.18x
8192x28672	299.0 us	59.0%	286.0 us	61.6%	1.05x
16384x28672	615.0 us	57.3%	556.0 us	63.4%	1.11x
2048x29568	87.3 us	52.0%	56.6 us	80.2%	1.54x
8192x29568	310.0 us	58.6%	286.0 us	63.5%	1.08x
8192x53248	547.0 us	59.9%	523.0 us	62.6%	1.05x

Arithmetic mean speedup: 1.47x | Geometric mean: 1.42x | Weighted (by elems): 1.15x
Optimized BW — Arithmetic: 4.44 Ti/s (61.1%) | Peak: 5.86 Ti/s (80.5%)

FP32->FP8 GPT-OSS MoE

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
64x2880	8.4 us	1.6%	6.9 us	2.0%	1.22x
256x2880	14.3 us	3.9%	7.8 us	7.1%	1.83x
320x2880	16.6 us	4.2%	11.8 us	5.9%	1.41x
496x2880	15.8 us	6.8%	18.5 us	5.8%	0.85x
1792x2880	22.8 us	17.0%	11.2 us	34.6%	2.04x
64x5760	10.3 us	2.7%	7.1 us	3.9%	1.44x
256x5760	22.8 us	4.8%	8.5 us	13.0%	2.67x
320x5760	17.0 us	8.1%	12.7 us	10.9%	1.34x
496x5760	21.2 us	10.1%	21.6 us	9.9%	0.98x
1792x5760	32.0 us	24.2%	15.3 us	50.5%	2.09x

Arithmetic mean speedup: 1.59x | Geometric mean: 1.50x | Weighted (by elems): 1.94x
Optimized BW — Arithmetic: 1.04 Ti/s (14.3%) | Peak: 3.67 Ti/s (50.5%)

BF16->FP8 GPT-OSS

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
2880x2880	32.0 us	13.0%	23.3 us	17.8%	1.37x
5120x2880	37.9 us	19.4%	18.3 us	40.3%	2.07x
5760x2880	40.6 us	20.4%	27.8 us	29.9%	1.46x
16384x2880	51.4 us	45.9%	38.0 us	62.0%	1.35x
2880x4096	28.2 us	20.9%	17.1 us	34.4%	1.65x
16384x4096	112.0 us	30.0%	50.6 us	66.3%	2.21x
16384x5120	210.0 us	20.0%	54.1 us	77.6%	3.88x

Arithmetic mean speedup: 2.00x | Geometric mean: 1.87x | Weighted (by elems): 2.27x
Optimized BW — Arithmetic: 3.41 Ti/s (46.9%) | Peak: 5.65 Ti/s (77.6%)

BF16->FP8 LLM (Llama/Qwen)

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
1024x3584	18.1 us	10.1%	9.3 us	19.7%	1.94x
1024x4096	19.9 us	10.6%	9.4 us	22.2%	2.11x
2048x4096	23.6 us	17.8%	11.2 us	37.5%	2.11x
4096x4096	26.3 us	31.9%	16.7 us	50.4%	1.57x
1024x8192	23.8 us	17.7%	11.3 us	37.3%	2.11x
2048x8192	25.9 us	32.4%	17.1 us	49.0%	1.51x
4096x8192	60.2 us	27.9%	30.3 us	55.4%	1.99x
8192x8192	109.0 us	30.8%	50.5 us	66.4%	2.16x
16384x8192	383.0 us	17.5%	86.4 us	77.6%	4.43x
32768x8192	694.0 us	19.3%	286.0 us	46.8%	2.43x
2048x14336	36.8 us	39.9%	23.8 us	61.6%	1.55x
4096x14336	110.0 us	26.7%	41.2 us	71.3%	2.67x
4096x16384	111.0 us	30.2%	47.4 us	70.8%	2.34x
8192x16384	379.0 us	17.7%	84.4 us	79.5%	4.49x
16384x16384	742.0 us	18.1%	296.0 us	45.3%	2.51x
32768x16384	1.37 ms	19.6%	528.0 us	50.9%	2.60x
1024x18944	31.8 us	30.5%	21.8 us	44.5%	1.46x
2048x28672	111.0 us	26.4%	41.0 us	71.6%	2.71x
4096x28672	296.0 us	19.9%	73.1 us	80.4%	4.05x
8192x28672	590.0 us	19.9%	200.0 us	58.7%	2.95x
16384x28672	1.19 ms	19.7%	388.0 us	60.5%	3.07x
2048x29568	112.0 us	27.0%	41.8 us	72.5%	2.68x
8192x29568	614.0 us	19.7%	190.0 us	63.6%	3.23x
8192x53248	1.09 ms	20.0%	354.0 us	61.7%	3.07x

Arithmetic mean speedup: 2.57x | Geometric mean: 2.45x | Weighted (by elems): 2.88x
Optimized BW — Arithmetic: 4.11 Ti/s (56.5%) | Peak: 5.85 Ti/s (80.4%)

BF16->FP8 GPT-OSS MoE

Shape	Base Time	Base %Peak	Opt Time	Opt %Peak	Speedup
64x2880	8.6 us	1.1%	10.8 us	0.9%	0.79x
256x2880	14.4 us	2.6%	12.7 us	2.9%	1.13x
320x2880	16.5 us	2.8%	20.5 us	2.3%	0.80x
496x2880	20.6 us	3.5%	29.6 us	2.4%	0.70x
1792x2880	22.1 us	11.7%	14.1 us	18.3%	1.57x
64x5760	10.4 us	1.8%	7.2 us	2.6%	1.44x
256x5760	15.0 us	4.9%	8.6 us	8.6%	1.74x
320x5760	17.2 us	5.4%	12.7 us	7.2%	1.35x
496x5760	23.1 us	6.2%	26.1 us	5.5%	0.89x
1792x5760	30.5 us	16.9%	13.2 us	39.0%	2.31x

Arithmetic mean speedup: 1.27x | Geometric mean: 1.19x | Weighted (by elems): 1.96x
Optimized BW — Arithmetic: 0.65 Ti/s (9.0%) | Peak: 2.84 Ti/s (39.0%)

Summary

Cast-Only (new 1D grid-stride kernel)

The baseline uses a generic VectorizedUnaryKernel with per-element scalar processing and no FP8 hardware intrinsics.

Metric	FP32->FP8	BF16->FP8
Peak optimized BW	6.26 Ti/s (86.0%)	6.08 Ti/s (83.5%)
Geometric mean speedup (LLM shapes)	~3.0x	~2.7x
Geometric mean speedup (GPT-OSS)	~3.1x	~2.6x

Cast+Transpose (optimized tiled kernel)

The baseline uses the NVIDIA RTC kernel compiled via hipRTC, which selects tile sizes dynamically but lacks AMD-specific optimizations (no FP8 intrinsics, no NT stores, no occupancy-aware LOAD cap).

Metric	FP32->FP8	BF16->FP8
Peak optimized BW	5.84 Ti/s (80.3%)	5.87 Ti/s (80.6%)
Geometric mean speedup (LLM shapes)	~1.4x	~2.2x
Geometric mean speedup (GPT-OSS)	~1.5x	~2.1x

Optimizations Applied (Kept)

Cast+Transpose Kernel (`rocm_cast_transpose.cuh`)

OVecT packed FP8 shared memory — smem stores CVec<fp8,8> instead of float, 4x smaller footprint, avoids bank conflicts with +1 padding
Register transpose during load — accumulate transposed FP8 into local_t[j2][iter].val[i2] during the load phase, avoiding a separate transpose pass
Non-temporal stores for output_c (rowwise) and output_t (transposed) — __builtin_nontemporal_store confirmed as global_store_* ... nt in assembly
gfx950 FP8 packed intrinsics — __builtin_amdgcn_cvt_scalef32_pk_fp8_f32 with scale=1.0 and pre-multiply (the intrinsic's scale param is E8M0 format, not arbitrary float)
word_select packing — two intrinsic calls with word_select=false/true pack 4 FP8 values into one uint32
Two-launch row strategy — STORE=8 for bulk, then best-fit single launch for remainder (STORE=4 if rem%128==0, STORE=2 if rem%64==0, general kernel otherwise). Max 2 launches for any M value
Column cascade — LOAD_SZ checks for single-launch alignment, then cascades to smaller LOAD sizes for remainder columns
CVec standalone vector type — aligned vector struct with load(), nt_load(), store(), nt_store() methods. No dependency on TE's Vec infrastructure
BF16/FP16 LOAD capped at 8 — LOAD=16 for BF16 uses 211 VGPRs (2 waves/SIMD). Capping at LOAD=8 uses 125 VGPRs (4 waves/SIMD), doubling occupancy

Cast-Only Kernel (`rocm_cast.cuh`)

Dedicated 1D grid-stride kernel — flat 1D indexing over M*N elements. No tiling, no cascade, single kernel launch for any shape
256 threads/block, 16 elements/thread — 42 VGPRs (from assembly), 10 waves/SIMD max, 0 scratch, 32 bytes LDS (amax only)
Direct FP8 packing into OVec — intrinsic results written directly into the output CVec via reinterpret_cast<uint32_t*>. No intermediate converted[] array, which preserves the NT store hint through the compiler
Non-temporal stores — CVec::nt_store() confirmed as global_store_dwordx4 ... nt in assembly (required eliminating the intermediate array to prevent the compiler from dropping the NT hint)
gfx950 FP8 packed intrinsics — same as cast+transpose, 4 intrinsic pairs per 16 elements
Dynamic grid sizing — cu_count blocks for FP32 and small BF16 tensors; cu_count*2 for BF16 tensors >128M elements (crossover point determined empirically)
Scalar tail for non-aligned element counts (rarely exercised — model dimensions are multiples of 16+)

Optimizations Tried and Rejected

Cast+Transpose

Thin-M kernel (1 thread/column, row-splitting for M<256) — only 12 blocks for N=2880 across 256 CUs (4.7% utilization). The tiled cascade was 2-3x faster for M>=64
Hardware transpose via ds_read_tr8_b64 (gfx950 v2 kernel) — identical performance to v1. The output_t scattered write pattern is the bottleneck, not the transpose method
WARP_SIZE=64 — creates TILE_M=512 and smem=33KB, exceeding practical LDS budget per workgroup
Multi-tile K=2 (2 column tiles per block) — helps FP32 but catastrophically hurts BF16 on small shapes by halving block count
Full row cascade (STORE 8->4->2->1->scalar) — produces up to 5 kernel launches for non-aligned M (e.g., M=496). Replaced with two-launch strategy: STORE=8 + best-fit remainder (STORE=4/2/general)

Cast-Only

DO_TRANSPOSE template on tiled kernel — reused the cast+transpose kernel with transpose disabled. Tiled structure imposed unnecessary alignment constraints and cascade overhead. The dedicated 1D kernel was 10-20% faster on small shapes and eliminated all cascade issues
Unconditional non-temporal loads — severe regressions across all shapes (up to -35% on 16384x4096). The L2 cache provides value for coalescing/prefetching even with read-once data on MI355X. NT loads deprioritize LRU eviction but also appear to disable hardware prefetch
Conditional NT loads (runtime branch on tensor size >512MB) — LLVM merged both branch paths during optimization and dropped the !nontemporal metadata. Only unconditional NT or template-parameterized NT can emit the nt flag
Grid size 1024 blocks — uniformly worse than 512 or 256. More blocks = more scheduling overhead + more atomicMax contention on amax
Grid size 128 blocks — large shapes collapse (1004 us vs 668 us for FP32 131072x5760). Insufficient parallelism for HBM saturation

Known Limitations

Cast+Transpose output_t scatter — the transposed output writes to output_t[col * num_rows + row], where adjacent threads write to cache lines spaced num_rows bytes apart. This scattered pattern caps bandwidth at ~80% of peak regardless of kernel optimizations
BF16 2880-col cascade — 2880 is not divisible by TILE_N=128 for BF16 (LOAD=8), requiring a 2-launch column cascade (2816 + 64 cols)
Cast+Transpose MoE regression for non-aligned M — shapes like 320x2880 and 496x2880 trigger the general-kernel fallback, which is slower than the baseline RTC kernel's single-launch approach. Production MoE workloads use multi_cast_transpose (batched) which amortizes this
ECC overhead — HBM3E uses on-die ECC, consuming ~6.25% of raw bandwidth for parity metadata. The theoretical ceiling for any streaming kernel is ~93.75% of advertised peak
Scale multiply overhead — the FP8 packed intrinsic's scale parameter uses E8M0 (power-of-2 exponent) format, not arbitrary float. We must pre-multiply by scale and pass 1.0 to the intrinsic, adding 16-128 extra v_mul_f32 instructions per tile

ipanfilo · 2026-04-06T17:54:30Z

transformer_engine/common/transpose/rocm_cast_transpose.cuh

+ * License for AMD contributions = MIT. See LICENSE for more information
+ ************************************************************************/
+#pragma once
+//#include "hip/hip_runtime.h" //dummy include to prevent hipification adding this header


Contrary to ROCm specific files in common/cast/mxfp8, this one uses CUDA API and/or datatypes so it will be hipified.
Let's switch to HIP API then

I have removed CUDA API code. One note -- I needed to replace cuda::sm_count() to avoid hipification, so have added a static const lambda to grab that value once.

ipanfilo · 2026-04-06T17:58:36Z

transformer_engine/common/transpose/cast_transpose.cu

 #include "cast_transpose.h"

+#ifdef __HIP_PLATFORM_AMD__
+#include "rocm_cast_transpose.cuh"


Should it be in detail namespace? And let's guard cast_transpose_general_kernel ans so on since they are not used anymore

It should be like upstream, so have added the dispatch function to detail. unused NV code is now guarded.

alextmagro force-pushed the ct_opt branch 3 times, most recently from 5c5f6fe to 9435f7a Compare April 4, 2026 05:30

alextmagro marked this pull request as ready for review April 4, 2026 05:31

alextmagro requested review from ipanfilo, wangye805 and wenchenvincent as code owners April 4, 2026 05:31

alextmagro requested review from aris134 and matthiasdiener April 4, 2026 05:32

alextmagro added the ci-level 3 CI test level 3 label Apr 4, 2026

Optimized Cast_Transpose for FP8

ec9dcc6

alextmagro force-pushed the ct_opt branch from 9435f7a to ec9dcc6 Compare April 6, 2026 17:20

ipanfilo requested changes Apr 6, 2026

View reviewed changes

HIP API changes and guards

23fa0ec

alextmagro requested a review from ipanfilo April 6, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ct opt#519

Ct opt#519
alextmagro wants to merge 2 commits intodevfrom
ct_opt

alextmagro commented Apr 4, 2026

Uh oh!

alextmagro commented Apr 4, 2026

Uh oh!

ipanfilo Apr 6, 2026

Uh oh!

alextmagro Apr 6, 2026

Uh oh!

ipanfilo Apr 6, 2026

Uh oh!

alextmagro Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alextmagro commented Apr 4, 2026

Uh oh!

alextmagro commented Apr 4, 2026

Claude results breakdown and process analysis

Hardware

Bandwidth Calculation

Cast-Only

FP32->FP8 GPT-OSS

FP32->FP8 LLM (Llama/Qwen)

FP32->FP8 GPT-OSS MoE

BF16->FP8 GPT-OSS

BF16->FP8 LLM (Llama/Qwen)

BF16->FP8 GPT-OSS MoE

Cast+Transpose

FP32->FP8 GPT-OSS

FP32->FP8 LLM (Llama/Qwen)

FP32->FP8 GPT-OSS MoE

BF16->FP8 GPT-OSS

BF16->FP8 LLM (Llama/Qwen)

BF16->FP8 GPT-OSS MoE

Summary

Cast-Only (new 1D grid-stride kernel)

Cast+Transpose (optimized tiled kernel)

Optimizations Applied (Kept)

Cast+Transpose Kernel (rocm_cast_transpose.cuh)

Cast-Only Kernel (rocm_cast.cuh)

Optimizations Tried and Rejected

Cast+Transpose

Cast-Only

Known Limitations

Uh oh!

ipanfilo Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

alextmagro Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

alextmagro Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cast+Transpose Kernel (`rocm_cast_transpose.cuh`)

Cast-Only Kernel (`rocm_cast.cuh`)