Skip to content

Conversation

@ikurtchen
Copy link

@ikurtchen ikurtchen commented Nov 18, 2025

The function can be controlled by following env:

  • VLLM_FUSEDSDPA_QKV_SLICE_SEQ_LEN_THLD
    • When query or context exceeds this threshold, enable QKV slice.
    • Default threshold is 8192.
    • Set to 0 to disable this function.
  • VLLM_FUSEDSDPA_Q_SLICE_CHUNK_SIZE
    • The Q chunk size for full attention part.
  • VLLM_FUSEDSDPA_KV_SLICE_CHUNK_SIZE
    • The KV chunk size for full attention part.
  • VLLM_FUSEDSDPA_CAUSAL_QKV_SLICE_CHUNK_SIZE
    • The QKV chunk size for causal attention part.

The function can be controlled by following env:
  VLLM_FUSEDSDPA_QKV_SLICE_SEQ_LEN_THLD
    When query or context exceeds this threshold, enable QKV slice.
    Default threshold is 8192.
    Set to 0 to disable this function.
  VLLM_FUSEDSDPA_Q_SLICE_CHUNK_SIZE
    The Q chunk size for full attention part.
  VLLM_FUSEDSDPA_KV_SLICE_CHUNK_SIZE
    The KV chunk size for full attention part.
  VLLM_FUSEDSDPA_CAUSAL_QKV_SLICE_CHUNK_SIZE
    The QKV chunk size for causal attention part.
@ikurtchen ikurtchen force-pushed the kurt/fusedsdpa_qkv_slice branch from 8c152d3 to 6c17418 Compare November 21, 2025 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant