-
Notifications
You must be signed in to change notification settings - Fork 272
pd with nixl backend (rebase main) #1002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kingder, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I'm introducing a new nixl
backend to LightLLM, which significantly enhances our distributed inference capabilities. This change allows for efficient KV cache transfer between prefill and decode nodes using the nixl
library, addressing performance bottlenecks in large-scale model serving. It integrates nixl
into our existing P/D split architecture, providing a specialized and optimized path for distributed operations.
Highlights
- Nixl Backend Integration: New
nixl_prefill
andnixl_decode
run modes have been introduced, enabling distributed KV cache management and transfer via thenixl
library for enhanced performance. - Paged KV Cache Movement: A paged KV cache mechanism has been implemented for
nixl
modes, allowing for efficient, chunked transfer of KV states between prefill and decode nodes. - Dedicated Communication Channels: Dedicated ZMQ-based communication channels and ports have been added for
nixl
prefill and meta-exchange, ensuring robust and high-throughput inter-node communication. - Docker Support: New Dockerfiles (
Dockerfile.nixl
,Dockerfile.nixl.deepep
) are provided to simplify the setup of environments withnixl
,openucx
,nvshmem
, andDeepEP
. - Enhanced Request Management: Existing request objects (
PDNIXLChunkedPrefillReq
) and command types (NIXLRemotePrefillDoneCmd
) have been extended to managenixl
-specific distributed states and remote prefill completion signals.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new backend for prefill/decode (PD) separation using the nixl
library, which appears to be a communication library for high-performance networking like RDMA. The changes are extensive, adding new Dockerfiles, run modes, and a significant amount of logic for handling remote prefill and KV cache transport. While the overall direction is clear, there are several areas that need attention, particularly concerning Docker best practices, code maintainability, and correctness in the new asynchronous logic. I've identified some potential bugs, such as using a blocking time.sleep
in an async function and incomplete state update logic, which are critical to address. There are also opportunities to improve the Dockerfiles for better security and reproducibility.
del self.remote_prefill_requests[req_id] | ||
del self.inflght_transfer_requests[req_id] | ||
|
||
time.sleep(PDNIXLBackendBase._THREAD_WAIT_INTERVAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#(TODO) figure out how to update req_to_next_token_ids | ||
# req.cur_output_len += 1 | ||
|
||
# pack = InferReqUpdatePack(req, req.cur_output_len) | ||
# pack.handle( | ||
# token_id, | ||
# token_logprob, | ||
# eos_ids=self.eos_id, | ||
# extra_post_req_handle_func=None, | ||
# is_master_in_dp=self.is_master_in_dp, | ||
# call_post_handle_for_chunk=False | ||
# ) | ||
return token_id | ||
|
||
def _decode_filter_reqs( | ||
self, prefill_reqs: List[InferReq], decode_reqs: List[InferReq] | ||
): | ||
new_prefill_reqs: List[InferReq] = [] | ||
remote_prefill_reqs: List[InferReq] = [] | ||
failed_prefill_reqs: List[InferReq] = [] | ||
next_token_ids: List[int] = [] | ||
rpd_reqs: List[InferReq] = [] | ||
|
||
for req in prefill_reqs: | ||
if req.in_prefill_or_transfer: | ||
if req.infer_nixl_rpd: | ||
shm_req: PDNIXLChunkedPrefillReq = req.shm_req | ||
# state is updated by router | ||
state = shm_req.get_pd_req_state() | ||
if state == RemoteTransferStatusType.SUCCESS.value: # success | ||
next_token_ids.append(self._post_remote_prefill(req)) | ||
rpd_reqs.append(req) | ||
elif state == RemoteTransferStatusType.FAILED.value: | ||
self._post_remote_prefill(req, False) | ||
failed_prefill_reqs.append(req) | ||
else: | ||
logger.warning(f"remote prefill request {shm_req.group_req_id} unexpected state {state}") | ||
else: | ||
remote_prefill_reqs.append(req) | ||
else: | ||
new_prefill_reqs.append(req) | ||
|
||
if rpd_reqs: | ||
# g_infer_context.req_manager.req_sampling_params_manager.update_reqs_token_counter( | ||
# rpd_reqs, | ||
# next_token_ids, | ||
# ) | ||
decode_reqs.extend(rpd_reqs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for updating request state after remote prefill seems incomplete. In _post_remote_prefill
, the logic to update req_to_next_token_ids
is commented out with a TODO
. In _decode_filter_reqs
, update_reqs_token_counter
is also commented out. This will lead to incorrect state tracking for requests, potentially causing bugs in token generation, length checks, and sampling.
RUN chmod 777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ | ||
ca-certificates \ | ||
libssl-dev \ | ||
curl \ | ||
g++ \ | ||
make \ | ||
git && \ | ||
rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RUN
command has a couple of issues that go against Docker best practices:
chmod 777 -R /tmp
: This is insecure as it gives world-writable permissions to the/tmp
directory.- Multiple
apt-get update
calls: This Dockerfile contains multipleRUN apt-get update
commands (here and on lines 42, 44, 69). This is inefficient and can lead to caching problems.
It's recommended to address these for a more secure and efficient Docker image.
RUN chmod 1777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
ca-certificates \
libssl-dev \
curl \
g++ \
make \
git && \
rm -rf /var/lib/apt/lists/*
RUN --mount=type=cache,target=/root/.cache/pip pip install -r /lightllm/requirements.txt --ignore-installed --extra-index-url https://download.pytorch.org/whl/cu124 | ||
|
||
RUN --mount=type=cache,target=/root/.cache/pip pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly | ||
RUN --mount=type=cache,target=/root/.cache/pip git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && pip install --no-deps -v . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloning from the main
branch for LightKernel
(here) and nixl
(L72) makes the build non-reproducible, as main
can change at any time. This can lead to unexpected behavior or build failures in the future.
RUN --mount=type=cache,target=/root/.cache/pip git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && git checkout <SPECIFIC_COMMIT_OR_TAG> && pip install --no-deps -v .
RUN chmod 777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ | ||
ca-certificates \ | ||
libssl-dev \ | ||
curl \ | ||
g++ \ | ||
make \ | ||
git && \ | ||
rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RUN
command has a couple of issues that go against Docker best practices:
chmod 777 -R /tmp
: This is insecure as it gives world-writable permissions to the/tmp
directory.- Multiple
apt-get update
calls: This Dockerfile contains multipleRUN apt-get update
commands (here and on lines 44, 82, 107). This is inefficient and can lead to caching problems.
It's recommended to address these for a more secure and efficient Docker image.
RUN chmod 1777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
ca-certificates \
libssl-dev \
curl \
g++ \
make \
git && \
rm -rf /var/lib/apt/lists/*
RUN --mount=type=cache,target=/root/.cache/pip pip install -r /lightllm/requirements.txt --ignore-installed --extra-index-url https://download.pytorch.org/whl/cu124 | ||
|
||
RUN --mount=type=cache,target=/root/.cache/pip pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly | ||
RUN --mount=type=cache,target=/root/.cache/pip git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && pip install --no-deps -v . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloning from the main
branch for LightKernel
(here) and nixl
(L110) makes the build non-reproducible, as main
can change at any time. This can lead to unexpected behavior or build failures in the future.
RUN --mount=type=cache,target=/root/.cache/pip git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && git checkout <SPECIFIC_COMMIT_OR_TAG> && pip install --no-deps -v .
if isinstance(self, MemoryManager) and type(self) != MemoryManager: | ||
raise NotImplementedError("subclass need reimpl this method") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check isinstance(self, MemoryManager) and type(self) != MemoryManager
to enforce subclass implementation is fragile and not the standard Python way. It will not work for subclasses of subclasses (e.g., class B(A)
where class A(MemoryManager)
). Also, there's a typo "reimpl".
Using Python's abc
module to define MemoryManager
as an abstract base class and alloc_paged_kv_move_buffer
as an abstract method would be a more robust approach.
if isinstance(self, MemoryManager) and type(self) != MemoryManager: | |
raise NotImplementedError("subclass need reimpl this method") | |
if type(self) is not MemoryManager: | |
raise NotImplementedError("subclass need to reimplement this method") |
No description provided.