Skip to content

pd with nixl backend (rebase main) #1002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Conversation

kingder
Copy link
Collaborator

@kingder kingder commented Aug 7, 2025

No description provided.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kingder, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I'm introducing a new nixl backend to LightLLM, which significantly enhances our distributed inference capabilities. This change allows for efficient KV cache transfer between prefill and decode nodes using the nixl library, addressing performance bottlenecks in large-scale model serving. It integrates nixl into our existing P/D split architecture, providing a specialized and optimized path for distributed operations.

Highlights

  • Nixl Backend Integration: New nixl_prefill and nixl_decode run modes have been introduced, enabling distributed KV cache management and transfer via the nixl library for enhanced performance.
  • Paged KV Cache Movement: A paged KV cache mechanism has been implemented for nixl modes, allowing for efficient, chunked transfer of KV states between prefill and decode nodes.
  • Dedicated Communication Channels: Dedicated ZMQ-based communication channels and ports have been added for nixl prefill and meta-exchange, ensuring robust and high-throughput inter-node communication.
  • Docker Support: New Dockerfiles (Dockerfile.nixl, Dockerfile.nixl.deepep) are provided to simplify the setup of environments with nixl, openucx, nvshmem, and DeepEP.
  • Enhanced Request Management: Existing request objects (PDNIXLChunkedPrefillReq) and command types (NIXLRemotePrefillDoneCmd) have been extended to manage nixl-specific distributed states and remote prefill completion signals.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new backend for prefill/decode (PD) separation using the nixl library, which appears to be a communication library for high-performance networking like RDMA. The changes are extensive, adding new Dockerfiles, run modes, and a significant amount of logic for handling remote prefill and KV cache transport. While the overall direction is clear, there are several areas that need attention, particularly concerning Docker best practices, code maintainability, and correctness in the new asynchronous logic. I've identified some potential bugs, such as using a blocking time.sleep in an async function and incomplete state update logic, which are critical to address. There are also opportunities to improve the Dockerfiles for better security and reproducibility.

del self.remote_prefill_requests[req_id]
del self.inflght_transfer_requests[req_id]

time.sleep(PDNIXLBackendBase._THREAD_WAIT_INTERVAL)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

time.sleep() is a blocking call and should not be used in an async function as it will block the entire event loop. This can severely degrade performance.

Suggested change
time.sleep(PDNIXLBackendBase._THREAD_WAIT_INTERVAL)
await asyncio.sleep(PDNIXLBackendBase._THREAD_WAIT_INTERVAL)

Comment on lines 428 to 475
#(TODO) figure out how to update req_to_next_token_ids
# req.cur_output_len += 1

# pack = InferReqUpdatePack(req, req.cur_output_len)
# pack.handle(
# token_id,
# token_logprob,
# eos_ids=self.eos_id,
# extra_post_req_handle_func=None,
# is_master_in_dp=self.is_master_in_dp,
# call_post_handle_for_chunk=False
# )
return token_id

def _decode_filter_reqs(
self, prefill_reqs: List[InferReq], decode_reqs: List[InferReq]
):
new_prefill_reqs: List[InferReq] = []
remote_prefill_reqs: List[InferReq] = []
failed_prefill_reqs: List[InferReq] = []
next_token_ids: List[int] = []
rpd_reqs: List[InferReq] = []

for req in prefill_reqs:
if req.in_prefill_or_transfer:
if req.infer_nixl_rpd:
shm_req: PDNIXLChunkedPrefillReq = req.shm_req
# state is updated by router
state = shm_req.get_pd_req_state()
if state == RemoteTransferStatusType.SUCCESS.value: # success
next_token_ids.append(self._post_remote_prefill(req))
rpd_reqs.append(req)
elif state == RemoteTransferStatusType.FAILED.value:
self._post_remote_prefill(req, False)
failed_prefill_reqs.append(req)
else:
logger.warning(f"remote prefill request {shm_req.group_req_id} unexpected state {state}")
else:
remote_prefill_reqs.append(req)
else:
new_prefill_reqs.append(req)

if rpd_reqs:
# g_infer_context.req_manager.req_sampling_params_manager.update_reqs_token_counter(
# rpd_reqs,
# next_token_ids,
# )
decode_reqs.extend(rpd_reqs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for updating request state after remote prefill seems incomplete. In _post_remote_prefill, the logic to update req_to_next_token_ids is commented out with a TODO. In _decode_filter_reqs, update_reqs_token_counter is also commented out. This will lead to incorrect state tracking for requests, potentially causing bugs in token generation, length checks, and sampling.

Comment on lines +9 to +16
RUN chmod 777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
ca-certificates \
libssl-dev \
curl \
g++ \
make \
git && \
rm -rf /var/lib/apt/lists/*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This RUN command has a couple of issues that go against Docker best practices:

  1. chmod 777 -R /tmp: This is insecure as it gives world-writable permissions to the /tmp directory.
  2. Multiple apt-get update calls: This Dockerfile contains multiple RUN apt-get update commands (here and on lines 42, 44, 69). This is inefficient and can lead to caching problems.

It's recommended to address these for a more secure and efficient Docker image.

RUN chmod 1777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    ca-certificates \
    libssl-dev \
    curl \
    g++ \
    make \
    git && \
    rm -rf /var/lib/apt/lists/*

RUN --mount=type=cache,target=/root/.cache/pip pip install -r /lightllm/requirements.txt --ignore-installed --extra-index-url https://download.pytorch.org/whl/cu124

RUN --mount=type=cache,target=/root/.cache/pip pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
RUN --mount=type=cache,target=/root/.cache/pip git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && pip install --no-deps -v .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Cloning from the main branch for LightKernel (here) and nixl (L72) makes the build non-reproducible, as main can change at any time. This can lead to unexpected behavior or build failures in the future.

RUN --mount=type=cache,target=/root/.cache/pip git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && git checkout <SPECIFIC_COMMIT_OR_TAG> && pip install --no-deps -v .

Comment on lines +11 to +18
RUN chmod 777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
ca-certificates \
libssl-dev \
curl \
g++ \
make \
git && \
rm -rf /var/lib/apt/lists/*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This RUN command has a couple of issues that go against Docker best practices:

  1. chmod 777 -R /tmp: This is insecure as it gives world-writable permissions to the /tmp directory.
  2. Multiple apt-get update calls: This Dockerfile contains multiple RUN apt-get update commands (here and on lines 44, 82, 107). This is inefficient and can lead to caching problems.

It's recommended to address these for a more secure and efficient Docker image.

RUN chmod 1777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    ca-certificates \
    libssl-dev \
    curl \
    g++ \
    make \
    git && \
    rm -rf /var/lib/apt/lists/*

RUN --mount=type=cache,target=/root/.cache/pip pip install -r /lightllm/requirements.txt --ignore-installed --extra-index-url https://download.pytorch.org/whl/cu124

RUN --mount=type=cache,target=/root/.cache/pip pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
RUN --mount=type=cache,target=/root/.cache/pip git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && pip install --no-deps -v .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Cloning from the main branch for LightKernel (here) and nixl (L110) makes the build non-reproducible, as main can change at any time. This can lead to unexpected behavior or build failures in the future.

RUN --mount=type=cache,target=/root/.cache/pip git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && git checkout <SPECIFIC_COMMIT_OR_TAG> && pip install --no-deps -v .

Comment on lines +100 to +101
if isinstance(self, MemoryManager) and type(self) != MemoryManager:
raise NotImplementedError("subclass need reimpl this method")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check isinstance(self, MemoryManager) and type(self) != MemoryManager to enforce subclass implementation is fragile and not the standard Python way. It will not work for subclasses of subclasses (e.g., class B(A) where class A(MemoryManager)). Also, there's a typo "reimpl".

Using Python's abc module to define MemoryManager as an abstract base class and alloc_paged_kv_move_buffer as an abstract method would be a more robust approach.

Suggested change
if isinstance(self, MemoryManager) and type(self) != MemoryManager:
raise NotImplementedError("subclass need reimpl this method")
if type(self) is not MemoryManager:
raise NotImplementedError("subclass need to reimplement this method")

Weichao Luo added 2 commits August 7, 2025 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants