You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The CUDA EP GroupQueryAttention kernel enforces MAX_HEAD_SIZE = 256, rejecting any model with head_dim > 256. This forces such layers to fall back to the standard Attention op, whose unfused runner produces NaN for fp16 (see #28195).
Motivation
Models like Gemma 4 use a hybrid attention architecture with two head dimensions:
Sliding-attention layers: head_dim=256 → GQA works perfectly ✓
Full-attention layers: head_dim=512 → GQA rejects, must use Attention → NaN on CUDA
This creates a situation where the model cannot run on CUDA EP at all for the full-attention layers, because:
Extend the GroupQueryAttention CUDA kernel to support head_dim > 256 (at least up to 512). This would allow models like Gemma 4 to use GQA for all layers, avoiding the broken unfused Attention runner entirely.
Workaround
Currently none for CUDA EP with these layers in fp16. CPU EP works correctly.
Feature Request
The CUDA EP
GroupQueryAttentionkernel enforcesMAX_HEAD_SIZE = 256, rejecting any model withhead_dim > 256. This forces such layers to fall back to the standardAttentionop, whose unfused runner produces NaN for fp16 (see #28195).Motivation
Models like Gemma 4 use a hybrid attention architecture with two head dimensions:
head_dim=256→ GQA works perfectly ✓head_dim=512→ GQA rejects, must use Attention → NaN on CUDAThis creates a situation where the model cannot run on CUDA EP at all for the full-attention layers, because:
head_dim=512head_dim <= 256limitRequested behavior
Extend the
GroupQueryAttentionCUDA kernel to supporthead_dim > 256(at least up to 512). This would allow models like Gemma 4 to use GQA for all layers, avoiding the broken unfused Attention runner entirely.Workaround
Currently none for CUDA EP with these layers in fp16. CPU EP works correctly.
Related