Solve underallocation in VSWA+/VGQA #4667

netanel-haber · 2025-05-26T13:38:11Z

Solving under-allocation of memory when a non-homogenous model is used - VSWA/VGQA or both.

About a month ago, I merged a hefty PR: #3028 .
What the PR did was create so-called WindowBlockManagers for each window size, that manage their own 'mini' kv-cache. Each manager manages potentially multiple pools, each corresponding to a unique number of kv heads. See docs.
Part of the work was divvying up the free memory ("number of blocks") between each WindowBlockManager - by calculating the weighted share of each manager.
In pseudo-code:
For each window size w:
1. weight[w] = w × (number of layers that use w)
2. totalWeight = sum of all weights
3. blocks[w] = totalBlocks × weight[w] / totalWeight

But, there are two bugs in the above calculation, and both of them are due to not taking into account that the term "blocks" suddenly becomes too generic in the case of a non-homogenous model - VSWA [Variable Sliding Window Attention]/VGQA [Variable GQA] or both:

In the case of VGQA: The weight doesn't take the number of KV heads of each layer into account. A smaller number of KV heads creates cheaper blocks.
In the case of VSWA: The weight doesn't take into account that in the first place, since each manager only manages a subset of layers, blocks within a given WindowBlockManager obviously span less than all layers, and therefore said blocks are also cheaper than full blocks [This has nothing special to do with V/SWA - any implementation that divides layers between managers needs to take this into account].

This pr addresses these shortcomings by weighting these properties as well.

Signed-off-by: Netanel Haber <[email protected]>

…location

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

cpp/include/tensorrt_llm/runtime/modelConfig.h

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp

Signed-off-by: Netanel Haber <[email protected]>

netanel-haber · 2025-05-26T21:39:58Z

/bot run

tensorrt-cicd · 2025-05-26T21:46:07Z

PR_Github #6506 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-26T21:58:41Z

PR_Github #6506 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #4762 completed with status: 'FAILURE'

Signed-off-by: Netanel Haber <[email protected]>

netanel-haber · 2025-05-27T11:19:51Z

/bot run

tensorrt-cicd · 2025-05-27T11:30:14Z

PR_Github #6629 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-27T14:51:21Z

PR_Github #6629 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4847 completed with status: 'FAILURE'

Signed-off-by: Netanel Haber <[email protected]>

netanel-haber · 2025-05-28T08:32:29Z

/bot run

tensorrt-cicd · 2025-05-28T08:47:57Z

PR_Github #6744 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-28T12:10:35Z

PR_Github #6744 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4919 completed with status: 'FAILURE'

Signed-off-by: Netanel Haber <[email protected]>

netanel-haber · 2025-06-10T16:06:06Z

/bot run

tensorrt-cicd · 2025-06-10T16:11:56Z

PR_Github #8327 [ run ] triggered by Bot

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

tensorrt-cicd · 2025-06-11T06:18:54Z

PR_Github #8327 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6031 completed with status: 'FAILURE'

…location

netanel-haber · 2025-06-11T09:04:12Z

/bot run

tensorrt-cicd · 2025-06-11T09:10:21Z

PR_Github #8462 [ run ] triggered by Bot

netanel-haber · 2025-06-12T03:37:09Z

/bot skip --comment "Successful CI has run multiple times and has failed due to flaky tests"

…location

netanel-haber · 2025-06-12T03:37:38Z

/bot skip --comment "Successful CI has run multiple times and has failed due to flaky tests"

tensorrt-cicd · 2025-06-12T03:43:19Z

PR_Github #8595 [ skip ] triggered by Bot

tensorrt-cicd · 2025-06-12T03:43:35Z

PR_Github #8596 [ skip ] triggered by Bot

tensorrt-cicd · 2025-06-12T03:43:38Z

PR_Github #8595 [ skip ] completed with state ABORTED

tensorrt-cicd · 2025-06-12T03:44:19Z

PR_Github #8462 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-12T04:12:42Z

PR_Github #8596 [ skip ] completed with state SUCCESS
Skipping testing for commit 37c8e56

allocate blocks per window size correctly

012f8ad

Signed-off-by: Netanel Haber <[email protected]>

netanel-haber requested a review from a team as a code owner May 26, 2025 13:38

netanel-haber requested a review from pcastonguay May 26, 2025 13:38

Merge branch 'main' into user/nhaber/fix-variable-window-size-underal…

d5da328

…location

netanel-haber requested review from Funatiq and removed request for pcastonguay May 26, 2025 13:47

pcastonguay requested review from Shixiaowei02 and thorjohnsen May 26, 2025 14:32

Funatiq reviewed May 26, 2025

View reviewed changes

netanel-haber commented May 26, 2025

View reviewed changes

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp Outdated Show resolved Hide resolved

netanel-haber commented May 26, 2025

View reviewed changes

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp Outdated Show resolved Hide resolved

netanel-haber added 4 commits May 26, 2025 20:02

simpler code path for common homogeneous models

f5265e6

Signed-off-by: Netanel Haber <[email protected]>

shorten: (b|B)locksPerWindowSize -> blocksPerWindow

f94e3c5

Signed-off-by: Netanel Haber <[email protected]>

fix trivial test compile errors

f3c3c63

Signed-off-by: Netanel Haber <[email protected]>

fix non-trivial compile errors

5e9c4de

Signed-off-by: Netanel Haber <[email protected]>

netanel-haber force-pushed the user/nhaber/fix-variable-window-size-underallocation branch from 5790310 to 5e9c4de Compare May 26, 2025 21:33

fix resource manager

3f12bd5

Signed-off-by: Netanel Haber <[email protected]>

netanel-haber changed the title ~~allocate blocks per window size correctly~~ Solve underallocation in VSWA+/VGQA May 27, 2025

fix extracostmemory

aad71d8

Signed-off-by: Netanel Haber <[email protected]>

minimize diff

0caef2d

Signed-off-by: Netanel Haber <[email protected]>

fix

2e086a0

Signed-off-by: Netanel Haber <[email protected]>

netanel-haber added 2 commits June 10, 2025 19:01

make logging quieter

13edfc1

Signed-off-by: Netanel Haber <[email protected]>

add ceremony

029d7e6

Signed-off-by: Netanel Haber <[email protected]>

Funatiq approved these changes Jun 10, 2025

View reviewed changes

symphonylyh reviewed Jun 10, 2025

View reviewed changes

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h Show resolved Hide resolved

juney-nvidia removed request for a team June 11, 2025 08:16

Shixiaowei02 requested a review from chuangz0 June 11, 2025 08:23

Merge branch 'main' into user/nhaber/fix-variable-window-size-underal…

ae25a10

…location

thorjohnsen approved these changes Jun 11, 2025

View reviewed changes

juney-nvidia approved these changes Jun 11, 2025

View reviewed changes

Merge branch 'main' into user/nhaber/fix-variable-window-size-underal…

37c8e56

…location

Shixiaowei02 approved these changes Jun 12, 2025

View reviewed changes

netanel-haber merged commit e692779 into NVIDIA:main Jun 12, 2025
3 checks passed

qixiang-99 mentioned this pull request Jun 12, 2025

Feat/pytorch vswa kvcachemanager #5151

Merged

6 tasks

Funatiq mentioned this pull request Jun 14, 2025

[nvbug/5195657][fix] fix reset spec buffer and update mMaxAttentionWindowVec logic #4904

Closed

chuangz0 mentioned this pull request Jun 28, 2025

chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service #5234

Merged

netanel-haber deleted the user/nhaber/fix-variable-window-size-underallocation branch July 1, 2025 09:55

lkm2835 mentioned this pull request Jul 31, 2025

Using max_attention_window (VSWA) reduces concurrent batch size and causes drop in throughput (gemma3 trt backend) #6503

Open

4 tasks

Solve underallocation in VSWA+/VGQA #4667

Solve underallocation in VSWA+/VGQA #4667

Uh oh!

Conversation

netanel-haber commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solving under-allocation of memory when a non-homogenous model is used - VSWA/VGQA or both.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

netanel-haber commented May 26, 2025

Uh oh!

tensorrt-cicd commented May 26, 2025

Uh oh!

tensorrt-cicd commented May 26, 2025

Uh oh!

netanel-haber commented May 27, 2025

Uh oh!

tensorrt-cicd commented May 27, 2025

Uh oh!

tensorrt-cicd commented May 27, 2025

Uh oh!

netanel-haber commented May 28, 2025

Uh oh!

tensorrt-cicd commented May 28, 2025

Uh oh!

tensorrt-cicd commented May 28, 2025

Uh oh!

netanel-haber commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

netanel-haber commented Jun 11, 2025

Uh oh!

tensorrt-cicd commented Jun 11, 2025

Uh oh!

netanel-haber commented Jun 12, 2025

Uh oh!

netanel-haber commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

tensorrt-cicd commented Jun 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

netanel-haber commented May 26, 2025 •

edited

Loading