-
Notifications
You must be signed in to change notification settings - Fork 276
pd with nixl backend (rebase main) #1002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
ca2cf32
08f3cc5
0e40f4a
48b3e6d
bfe5f09
a9decbc
f0a99b2
e61849d
5be42ad
3cb62d6
d7980f8
864fc56
6e47801
81d1487
0883600
53fcf6f
2eda97e
958d941
4842cbf
73e85cd
ae54d55
afa4cf1
caa503a
7a9ca8b
ef8daf1
f60c390
d710dd3
a1a469d
7d5ad30
e3aba0a
dd0f189
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
ARG CUDA_VERSION=12.6.1 | ||
FROM nvidia/cuda:${CUDA_VERSION}-cudnn-devel-ubuntu22.04 | ||
ARG PYTHON_VERSION=3.10 | ||
ARG MAMBA_VERSION=24.7.1-0 | ||
ARG TARGETPLATFORM | ||
ENV PATH=/opt/conda/bin:$PATH \ | ||
CONDA_PREFIX=/opt/conda | ||
|
||
RUN chmod 777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ | ||
ca-certificates \ | ||
libssl-dev \ | ||
curl \ | ||
g++ \ | ||
make \ | ||
git && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
RUN case ${TARGETPLATFORM} in \ | ||
"linux/arm64") MAMBA_ARCH=aarch64 ;; \ | ||
*) MAMBA_ARCH=x86_64 ;; \ | ||
esac && \ | ||
curl -fsSL -o ~/mambaforge.sh -v "https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-${MAMBA_ARCH}.sh" && \ | ||
bash ~/mambaforge.sh -b -p /opt/conda && \ | ||
rm ~/mambaforge.sh | ||
|
||
RUN case ${TARGETPLATFORM} in \ | ||
"linux/arm64") exit 1 ;; \ | ||
*) /opt/conda/bin/conda update -y conda && \ | ||
/opt/conda/bin/conda install -y "python=${PYTHON_VERSION}" ;; \ | ||
esac && \ | ||
/opt/conda/bin/conda clean -ya | ||
|
||
|
||
WORKDIR /root | ||
|
||
COPY ./requirements.txt /lightllm/requirements.txt | ||
RUN --mount=type=cache,target=/root/.cache/pip pip install -r /lightllm/requirements.txt --ignore-installed --extra-index-url https://download.pytorch.org/whl/cu124 | ||
|
||
RUN --mount=type=cache,target=/root/.cache/pip pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly | ||
RUN --mount=type=cache,target=/root/.cache/pip git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && pip install --no-deps -v . | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cloning from the
|
||
|
||
RUN apt-get update && apt-get install -y libnuma-dev # for sgl_kernel | ||
|
||
RUN apt-get update && apt-get install -y cmake automake autotools-dev libtool libz-dev && \ | ||
DEBIAN_FRONTEND=noninteractive apt-get -y install --reinstall libibverbs-dev rdma-core ibverbs-utils libibumad-dev; \ | ||
rm -rf /usr/lib/ucx && \ | ||
rm -rf /opt/hpcx/ucx && \ | ||
cd /usr/local/src && \ | ||
git clone https://github.com/openucx/ucx.git && \ | ||
cd ucx && \ | ||
git checkout v1.19.x && \ | ||
./autogen.sh && ./configure \ | ||
--enable-shared \ | ||
--disable-static \ | ||
--disable-doxygen-doc \ | ||
--enable-optimizations \ | ||
--enable-cma \ | ||
--enable-devel-headers \ | ||
--with-cuda=/usr/local/cuda \ | ||
--with-verbs=yes \ | ||
--with-dm \ | ||
--with-gdrcopy=/usr/local \ | ||
--with-efa \ | ||
--enable-mt && \ | ||
make -j && \ | ||
make -j install-strip && \ | ||
ldconfig; | ||
|
||
RUN apt-get update && apt-get install -y pkg-config tmux net-tools; \ | ||
cd /usr/local/src; \ | ||
pip install --upgrade meson pybind11 patchelf; \ | ||
git clone https://github.com/ai-dynamo/nixl.git -b main && \ | ||
cd nixl && \ | ||
rm -rf build && \ | ||
mkdir build && \ | ||
meson setup build/ --prefix=/usr/local/nixl --buildtype=release && \ | ||
cd build && \ | ||
ninja && \ | ||
ninja install && \ | ||
cd .. && pip install . --no-deps; | ||
|
||
COPY . /lightllm | ||
RUN pip install -e /lightllm --no-cache-dir |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
ARG CUDA_VERSION=12.6.1 | ||
FROM nvidia/cuda:${CUDA_VERSION}-cudnn-devel-ubuntu22.04 | ||
|
||
ARG PYTHON_VERSION=3.10 | ||
ARG MAMBA_VERSION=24.7.1-0 | ||
ARG TARGETPLATFORM | ||
|
||
ENV PATH=/opt/conda/bin:$PATH \ | ||
CONDA_PREFIX=/opt/conda | ||
|
||
RUN chmod 777 -R /tmp && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ | ||
ca-certificates \ | ||
libssl-dev \ | ||
curl \ | ||
g++ \ | ||
make \ | ||
git && \ | ||
rm -rf /var/lib/apt/lists/* | ||
Comment on lines
+11
to
+18
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This
It's recommended to address these for a more secure and efficient Docker image.
|
||
|
||
RUN case ${TARGETPLATFORM} in \ | ||
"linux/arm64") MAMBA_ARCH=aarch64 ;; \ | ||
*) MAMBA_ARCH=x86_64 ;; \ | ||
esac && \ | ||
curl -fsSL -o ~/mambaforge.sh -v "https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-${MAMBA_ARCH}.sh" && \ | ||
bash ~/mambaforge.sh -b -p /opt/conda && \ | ||
rm ~/mambaforge.sh | ||
|
||
RUN case ${TARGETPLATFORM} in \ | ||
"linux/arm64") exit 1 ;; \ | ||
*) /opt/conda/bin/conda update -y conda && \ | ||
/opt/conda/bin/conda install -y "python=${PYTHON_VERSION}" ;; \ | ||
esac && \ | ||
/opt/conda/bin/conda clean -ya | ||
|
||
|
||
WORKDIR /root | ||
|
||
COPY ./requirements.txt /lightllm/requirements.txt | ||
RUN --mount=type=cache,target=/root/.cache/pip pip install -r /lightllm/requirements.txt --ignore-installed --extra-index-url https://download.pytorch.org/whl/cu124 | ||
|
||
RUN --mount=type=cache,target=/root/.cache/pip pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly | ||
RUN --mount=type=cache,target=/root/.cache/pip git clone https://github.com/ModelTC/LightKernel.git && cd LightKernel && pip install --no-deps -v . | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cloning from the
|
||
|
||
RUN apt-get update && apt-get install -y libnuma-dev wget devscripts debhelper dh-make build-essential dkms | ||
RUN apt-get install -y ibverbs-providers infiniband-diags perftest rdma-core libibverbs-dev librdmacm-dev | ||
|
||
ENV CUDA_HOME=/usr/local/cuda \ | ||
GDRCOPY_HOME=/usr/src/gdrdrv-2.4.4/ | ||
|
||
RUN mkdir -p /tmp/gdrcopy && cd /tmp \ | ||
&& git clone https://github.com/NVIDIA/gdrcopy.git -b v2.4.4 \ | ||
&& cd gdrcopy/packages \ | ||
&& CUDA=/usr/local/cuda ./build-deb-packages.sh \ | ||
&& dpkg -i gdrdrv-dkms_*.deb libgdrapi_*.deb gdrcopy-tests_*.deb gdrcopy_*.deb \ | ||
&& cd / && rm -rf /tmp/gdrcopy | ||
|
||
# Fix DeepEP IBGDA symlink | ||
RUN ln -sf /usr/lib/x86_64-linux-gnu/libmlx5.so.1 /usr/lib/x86_64-linux-gnu/libmlx5.so | ||
|
||
RUN wget https://developer.download.nvidia.com/compute/redist/nvshmem/3.3.9/source/nvshmem_src_cuda12-all-all-3.3.9.tar.gz \ | ||
&& tar -xf nvshmem_src_cuda12-all-all-3.3.9.tar.gz && mv nvshmem_src nvshmem \ | ||
&& cd nvshmem \ | ||
&& rm -f /root/nvshmem_src_cuda12-all-all-3.3.9.tar.gz \ | ||
&& NVSHMEM_SHMEM_SUPPORT=0 \ | ||
NVSHMEM_UCX_SUPPORT=0 \ | ||
NVSHMEM_USE_NCCL=0 \ | ||
NVSHMEM_MPI_SUPPORT=0 \ | ||
NVSHMEM_IBGDA_SUPPORT=1 \ | ||
NVSHMEM_PMIX_SUPPORT=0 \ | ||
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \ | ||
NVSHMEM_USE_GDRCOPY=1 \ | ||
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/root/nvshmem/install -DCMAKE_CUDA_ARCHITECTURES=90 \ | ||
&& cmake --build build --target install -j64 | ||
|
||
ARG DEEPEP_COMMIT=b6ce310bb0b75079682d09bc2ebc063a074fbd58 | ||
RUN git clone https://github.com/deepseek-ai/DeepEP.git && cd DeepEP && git checkout ${DEEPEP_COMMIT} && cd .. | ||
|
||
WORKDIR /root/DeepEP | ||
ENV NVSHMEM_DIR=/root/nvshmem/install | ||
RUN NVSHMEM_DIR=/root/nvshmem/install python setup.py install | ||
|
||
RUN apt-get update && apt-get install -y cmake automake autotools-dev libtool libz-dev && \ | ||
DEBIAN_FRONTEND=noninteractive apt-get -y install --reinstall libibverbs-dev rdma-core ibverbs-utils libibumad-dev; \ | ||
rm -rf /usr/lib/ucx && \ | ||
rm -rf /opt/hpcx/ucx && \ | ||
cd /usr/local/src && \ | ||
git clone https://github.com/openucx/ucx.git && \ | ||
cd ucx && \ | ||
git checkout v1.19.x && \ | ||
./autogen.sh && ./configure \ | ||
--enable-shared \ | ||
--disable-static \ | ||
--disable-doxygen-doc \ | ||
--enable-optimizations \ | ||
--enable-cma \ | ||
--enable-devel-headers \ | ||
--with-cuda=/usr/local/cuda \ | ||
--with-verbs=yes \ | ||
--with-dm \ | ||
--with-gdrcopy=/usr/local \ | ||
--with-efa \ | ||
--enable-mt && \ | ||
make -j && \ | ||
make -j install-strip && \ | ||
ldconfig; | ||
|
||
RUN apt-get update && apt-get install -y pkg-config tmux net-tools ; \ | ||
cd /usr/local/src; \ | ||
pip install --upgrade meson pybind11 patchelf; \ | ||
git clone https://github.com/ai-dynamo/nixl.git -b main && \ | ||
cd nixl && \ | ||
rm -rf build && \ | ||
mkdir build && \ | ||
meson setup build/ --prefix=/usr/local/nixl --buildtype=release && \ | ||
cd build && \ | ||
ninja && \ | ||
ninja install && \ | ||
cd .. && pip install . --no-deps; | ||
|
||
COPY . /lightllm | ||
RUN pip install -e /lightllm --no-cache-dir |
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -96,6 +96,14 @@ def alloc_kv_move_buffer(self, max_req_total_len): | |||||||||
self.token_dim_size = self.kv_move_buffer.shape[-2] * self.kv_move_buffer.shape[-1] | ||||||||||
return | ||||||||||
|
||||||||||
def alloc_paged_kv_move_buffer(self, page_num, page_size): | ||||||||||
if isinstance(self, MemoryManager) and type(self) != MemoryManager: | ||||||||||
raise NotImplementedError("subclass need reimpl this method") | ||||||||||
Comment on lines
+100
to
+101
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The check Using Python's
Suggested change
|
||||||||||
self.kv_move_buffer = torch.empty( | ||||||||||
(page_num, page_size, self.layer_num, 2 * self.head_num, self.head_dim), dtype=self.dtype, device="cuda" | ||||||||||
) | ||||||||||
return | ||||||||||
|
||||||||||
def send_to_decode_node( | ||||||||||
self, | ||||||||||
move_tasks: List[KVMoveTask], | ||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
from .sampling_params import SamplingParams | ||
from .req import Req, FinishStatus | ||
from .req import Req, FinishStatus, PDNIXLChunkedPrefillReq | ||
from .shm_req_manager import ShmReqManager | ||
from .rpc_shm import RpcShmParams, RpcShmResults, ShmSyncStatusArray | ||
from .start_args_type import StartArgs |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
from .group_req import GroupReqIndexes, GroupReqObjs, AbortedReqCmd | ||
from .group_req import GroupReqIndexes, GroupReqObjs, AbortedReqCmd, NIXLRemotePrefillDoneCmd, ReqCmd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
RUN
command has a couple of issues that go against Docker best practices:chmod 777 -R /tmp
: This is insecure as it gives world-writable permissions to the/tmp
directory.apt-get update
calls: This Dockerfile contains multipleRUN apt-get update
commands (here and on lines 42, 44, 69). This is inefficient and can lead to caching problems.It's recommended to address these for a more secure and efficient Docker image.