Conversation
Claude results breakdown and process analysisHardware
Bandwidth Calculation
Cast-OnlyFP32->FP8 GPT-OSS
Arithmetic mean speedup: 3.18x | Geometric mean: 3.14x | Weighted (by elems): 3.50x FP32->FP8 LLM (Llama/Qwen)
Arithmetic mean speedup: 3.07x | Geometric mean: 2.98x | Weighted (by elems): 2.53x FP32->FP8 GPT-OSS MoE
Arithmetic mean speedup: 1.38x | Geometric mean: 1.30x | Weighted (by elems): 2.11x BF16->FP8 GPT-OSS
Arithmetic mean speedup: 2.53x | Geometric mean: 2.49x | Weighted (by elems): 2.89x BF16->FP8 LLM (Llama/Qwen)
Arithmetic mean speedup: 2.60x | Geometric mean: 2.53x | Weighted (by elems): 2.85x BF16->FP8 GPT-OSS MoE
Arithmetic mean speedup: 1.15x | Geometric mean: 1.12x | Weighted (by elems): 1.68x Cast+TransposeFP32->FP8 GPT-OSS
Arithmetic mean speedup: 1.63x | Geometric mean: 1.59x | Weighted (by elems): 1.38x FP32->FP8 LLM (Llama/Qwen)
Arithmetic mean speedup: 1.47x | Geometric mean: 1.42x | Weighted (by elems): 1.15x FP32->FP8 GPT-OSS MoE
Arithmetic mean speedup: 1.59x | Geometric mean: 1.50x | Weighted (by elems): 1.94x BF16->FP8 GPT-OSS
Arithmetic mean speedup: 2.00x | Geometric mean: 1.87x | Weighted (by elems): 2.27x BF16->FP8 LLM (Llama/Qwen)
Arithmetic mean speedup: 2.57x | Geometric mean: 2.45x | Weighted (by elems): 2.88x BF16->FP8 GPT-OSS MoE
Arithmetic mean speedup: 1.27x | Geometric mean: 1.19x | Weighted (by elems): 1.96x SummaryCast-Only (new 1D grid-stride kernel)The baseline uses a generic
Cast+Transpose (optimized tiled kernel)The baseline uses the NVIDIA RTC kernel compiled via hipRTC, which selects tile sizes dynamically but lacks AMD-specific optimizations (no FP8 intrinsics, no NT stores, no occupancy-aware LOAD cap).
Optimizations Applied (Kept)Cast+Transpose Kernel (
|
5c5f6fe to
9435f7a
Compare
| * License for AMD contributions = MIT. See LICENSE for more information | ||
| ************************************************************************/ | ||
| #pragma once | ||
| //#include "hip/hip_runtime.h" //dummy include to prevent hipification adding this header |
There was a problem hiding this comment.
Contrary to ROCm specific files in common/cast/mxfp8, this one uses CUDA API and/or datatypes so it will be hipified.
Let's switch to HIP API then
There was a problem hiding this comment.
I have removed CUDA API code. One note -- I needed to replace cuda::sm_count() to avoid hipification, so have added a static const lambda to grab that value once.
| #include "cast_transpose.h" | ||
|
|
||
| #ifdef __HIP_PLATFORM_AMD__ | ||
| #include "rocm_cast_transpose.cuh" |
There was a problem hiding this comment.
Should it be in detail namespace? And let's guard cast_transpose_general_kernel ans so on since they are not used anymore
There was a problem hiding this comment.
It should be like upstream, so have added the dispatch function to detail. unused NV code is now guarded.
Improvements to cast_transpose and cast for FP8 delayed scaling
Introduced rocm specific cast and cast+transpose functions tuned for MI350s and MI300s
For memory-bound kernels:
Cast Only: 2.85x speedup on average
Cast Transpose: 2.0x speedup on average
This PR contains benchmarking scripts, so was branched off of #507.