-
Notifications
You must be signed in to change notification settings - Fork 283
[float8] Add fnuz fp8 dtypes to Float8Layout #2351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2351
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit 98eb0dc with merge base 16e2d0a ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Trying to think what's the best way to test this but I don't think it's that simple since we try and dequantize -> do dense matmul by default, which means that testing correctness is not enough here - Calling Any opinions on maybe turning off (or putting it behind a flag) the dequantize -> dense op fallback by default or will that break a lot of things? |
|
@@ -442,7 +442,7 @@ def _linear_fp_act_fp8_weight_check( | |||
# weight is float8 quantized affine quantized tensor | |||
isinstance(weight_tensor, AffineQuantizedTensor) | |||
and isinstance(weight_tensor._layout, Float8Layout) | |||
and weight_tensor.tensor_impl.dtype in [torch.float8_e4m3fn, torch.float8_e5m2] | |||
and _is_float8_type(weight_tensor.tensor_impl.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so previously it's using fallback? we should probably have a way to check the kernel is called, or just remove the fallback
we can try removing the fallback in a PR I think, it might be OK |
fallback is still the default bahavior, there is a flag for specific kernel choice as well if people want to make sure they are testing a specific kernel path |
This should give us AMD perf on vLLM. With Phi-4-mini-instruct on MI300x with TorchAO FP8 rowwise quant on the MLP I see the following, which is about a 5% speedup:
For comparison, here is the baseline Phi-4-mini-instruct on MI300x:
Previously, these checks were failing on the unsigned zero ROCm fp8 dtypes, causing us to call
.dequantize()
and then do a bfloat16 mm, which was slower than the bf16 baseline (~2s).