-
Notifications
You must be signed in to change notification settings - Fork 3.4k
RFC: Page-Granular Free Path for PagedTokenToKVPoolAllocator #13023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
keyboardAnt
wants to merge
16
commits into
sgl-project:main
Choose a base branch
from
keyboardAnt:feat/page-granular-free
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
RFC: Page-Granular Free Path for PagedTokenToKVPoolAllocator #13023
keyboardAnt
wants to merge
16
commits into
sgl-project:main
from
keyboardAnt:feat/page-granular-free
+913
−527
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… improve page management. Introduce debug mode for better memory tracking and enhance page allocation methods in PagedTokenToKVPoolAllocator and SWATokenToKVPoolAllocator. Update ChunkCache and RadixCache to streamline key handling and memory freeing processes.
Introduce the RadixKey class to manage token IDs and provide enhanced iteration and indexing capabilities. This change aims to streamline key handling within the radix_cache, ensuring compatibility with various input types while maintaining performance. Additionally, a helper function, get_child_key, is added for simplified access to child keys.
…and related classes to simplify memory allocation logic. This change enhances code clarity and maintains functionality across various components.
…ionality Introduce 'enable_metrics', 'eviction_policy', and 'is_eagle' parameters to the constructors of RadixCache and SWARadixCache. This update aims to improve cache management and provide additional configuration options for users, enhancing the overall flexibility of the memory cache system.
Update the memory cache allocators to replace 'free_page_ids' with 'free_pages' for consistency across the codebase. This change enhances the clarity of the API and ensures that page management is handled uniformly. Additionally, introduce a new test suite for the 'free_pages' method to validate its functionality and integration within the allocator system.
…ages This update adds a back-compatibility alias 'free_page_ids' to the BaseTokenToKVPoolAllocator class, allowing older call sites to function without modification. Additionally, the ChunkCache and common memory management functions are updated to utilize this new alias, improving consistency in page freeing methods across the codebase.
…, unrelated tests) to keep PR #1 minimal
…classes to streamline memory allocation logic. This change enhances code clarity and maintains functionality across various components.
…locators This update modifies the page ID calculations in various classes, including PagedTokenToKVPoolAllocator, RadixCache, and SWARadixCache, to ensure correct indexing by adding 1 to the division result. Additionally, it introduces idle checks in the SchedulerRuntimeCheckerMixin to skip checks when a batch is in-flight, improving performance during active processing.
…ibute shadowing This update modifies the calls to the free_pages method within the BaseTokenToKVPoolAllocator class and its subclasses to prevent instance attribute shadowing. The changes ensure that the method is called correctly, maintaining functionality while improving code clarity.
…ning cache memory checks and adding detailed debug diagnostics. The update improves accuracy in identifying memory leaks by accounting for protected sizes and introducing tolerance for transient conditions, while also providing additional logging for better troubleshooting.
…requests by adding a new parameter. This update improves flexibility in handling unfinished requests while maintaining existing functionality.
… parameters This update simplifies the `alloc_extend` and `alloc_decode` methods in the memory cache allocators by removing unnecessary CPU tensor parameters. The changes enhance code clarity and maintain functionality by ensuring that only device tensors are passed, streamlining the allocation process.
…ge_ids This update modifies the memory management methods across various classes, including RadixCache, SWARadixCache, and common.py, to use the newly introduced free_page_ids method instead of free_pages. This change enhances code clarity and maintains functionality while ensuring consistency in page freeing operations throughout the codebase.
This update introduces deduplication of page IDs to prevent double-free attempts and adds checks to ensure only currently allocated pages are freed. These changes enhance the robustness of memory management within the allocator, maintaining functionality while improving error handling in debug mode.
…dixCache by adding detailed logging for memory management operations. This update introduces additional debug information for staged frees and decode-boundary slack estimates in SchedulerRuntimeCheckerMixin, as well as validation flags and reporting in RadixCache, improving observability and troubleshooting for memory-related issues.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
… improve page management. Introduce debug mode for better memory tracking and enhance page allocation methods in PagedTokenToKVPoolAllocator and SWATokenToKVPoolAllocator. Update ChunkCache and RadixCache to streamline key handling and memory freeing processes.
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
On L40SL:
meta-llama/Meta-Llama-3.1-8B-Instruct: +% throughput.
before (
mainon commit012bfc4fd):after:
Checklist