Skip to content

Conversation

@netanel-haber
Copy link
Collaborator

@netanel-haber netanel-haber commented May 26, 2025

Solving under-allocation of memory when a non-homogenous model is used - VSWA/VGQA or both.

About a month ago, I merged a hefty PR: #3028 .
What the PR did was create so-called WindowBlockManagers for each window size, that manage their own 'mini' kv-cache. Each manager manages potentially multiple pools, each corresponding to a unique number of kv heads. See docs.
Part of the work was divvying up the free memory ("number of blocks") between each WindowBlockManager - by calculating the weighted share of each manager.
In pseudo-code:
For each window size w:
1. weight[w] = w × (number of layers that use w)
2. totalWeight = sum of all weights
3. blocks[w] = totalBlocks × weight[w] / totalWeight

But, there are two bugs in the above calculation, and both of them are due to not taking into account that the term "blocks" suddenly becomes too generic in the case of a non-homogenous model - VSWA [Variable Sliding Window Attention]/VGQA [Variable GQA] or both:

  1. In the case of VGQA: The weight doesn't take the number of KV heads of each layer into account. A smaller number of KV heads creates cheaper blocks.
  2. In the case of VSWA: The weight doesn't take into account that in the first place, since each manager only manages a subset of layers, blocks within a given WindowBlockManager obviously span less than all layers, and therefore said blocks are also cheaper than full blocks [This has nothing special to do with V/SWA - any implementation that divides layers between managers needs to take this into account].

This pr addresses these shortcomings by weighting these properties as well.

@netanel-haber netanel-haber requested a review from a team as a code owner May 26, 2025 13:38
@netanel-haber netanel-haber requested a review from pcastonguay May 26, 2025 13:38
@netanel-haber netanel-haber requested review from Funatiq and removed request for pcastonguay May 26, 2025 13:47
@netanel-haber netanel-haber force-pushed the user/nhaber/fix-variable-window-size-underallocation branch from 5790310 to 5e9c4de Compare May 26, 2025 21:33
Signed-off-by: Netanel Haber <[email protected]>
@netanel-haber
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6506 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6506 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #4762 completed with status: 'FAILURE'

@netanel-haber netanel-haber changed the title allocate blocks per window size correctly Solve underallocation in VSWA+/VGQA May 27, 2025
Signed-off-by: Netanel Haber <[email protected]>
@netanel-haber
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6629 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6629 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4847 completed with status: 'FAILURE'

Signed-off-by: Netanel Haber <[email protected]>
@netanel-haber
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6744 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6744 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4919 completed with status: 'FAILURE'

Signed-off-by: Netanel Haber <[email protected]>
Signed-off-by: Netanel Haber <[email protected]>
Signed-off-by: Netanel Haber <[email protected]>
@netanel-haber
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8327 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8327 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6031 completed with status: 'FAILURE'

@juney-nvidia juney-nvidia removed request for a team June 11, 2025 08:16
@Shixiaowei02 Shixiaowei02 requested a review from chuangz0 June 11, 2025 08:23
@netanel-haber
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8462 [ run ] triggered by Bot

@netanel-haber
Copy link
Collaborator Author

/bot skip --comment "Successful CI has run multiple times and has failed due to flaky tests"

@netanel-haber
Copy link
Collaborator Author

/bot skip --comment "Successful CI has run multiple times and has failed due to flaky tests"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8595 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8596 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8595 [ skip ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8462 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8596 [ skip ] completed with state SUCCESS
Skipping testing for commit 37c8e56

@netanel-haber netanel-haber merged commit e692779 into NVIDIA:main Jun 12, 2025
3 checks passed
@qixiang-99 qixiang-99 mentioned this pull request Jun 12, 2025
6 tasks
@netanel-haber netanel-haber deleted the user/nhaber/fix-variable-window-size-underallocation branch July 1, 2025 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants