This is a repo for the paper Efficiently Serving Large Multimodal Models Using EPD Disaggregation (Arxiv link), accepted at ICML 2025 π
Link to Huawei AI Gallery Notebook: [AI Gallery]
EPDServe implements the **Encode-Prefill-Decode (EPD) Disaggregation** architecture proposed in the EPD paper. It is designed to serve large multimodal models (LMMs) efficiently by **splitting the inference pipeline** into three independent stages:
- π¨ Encoding Stage: Processes multimodal inputs (images, audio, video)
- π§ Context (Prefill) Stage: Handles prompt token prefill
- π Decoding Stage: Performs autoregressive token generation
Each stage operates independently with its own compute resources, scheduler, cache manager, and GPU workers.
- Disaggregated multimodal LMM inference
- Independent scaling of encoding, prefill, - and decoding
- Intra-request parallelism for distributing heavy encoding loads
- Monitiring delays/ bottlenecks at various stages (viz. load estiamtor)
- Cuda-level IPC utilities (csrc) for fast and asynchronous token and model weights transfer
- Dynamic GPU resource reallocation and role switching
- Minimal latency and maximized throughput (TTFT / E2ETP)
It is highly modular, extensible, and tightly aligned with the goals of reducing TTFT and maximizing E2ETP.
+---------------------+
| API Frontend |
+----------+----------+
|
[EPDOrchestrator] [Load Estimator] [Resource Allocator]
|
+-------------------+-------------------+
| | |
[EncodingStageCluster] [ContextStageCluster] [DecodingStageCluster]
| | |
[EncodingStageEngines] [ContextStageEngines] [DecodingStageEngines]
| | |
[Workers] [Workers] [Workers]
Component | Role |
---|---|
EPDOrchestrator |
Top-level pipeline controller |
*StageCluster |
Manages DP stage engines |
*StageEngine |
Runs scheduler, block manager, workers |
*StageScheduler |
Forms batches, assigns work |
Worker |
Executes model compute on GPUs, manages weights & cache |
BlockManager |
Allocates/free cache blocks |
LoadEstimator |
Tracks system metrics |
ResourceAllocator |
Decides dynamic stage reassignments |
api_server / Request |
Provides RESTful APIs and tracks request info |
For mode details look at README_CODE.md
EPD disaggregation relies on custom CUDA/C++ utilities to enable high-performance inter-process GPU communication for cache and model weight management.
These components, implemented in the csrc/ folder, provide the core infrastructure for fast GPU-to-GPU transfers and zero-copy tensor sharing.
File | Purpose |
---|---|
block_migration.cpp |
Custom cache block migration for KV/MM caches (adapted for vLLM) |
zero_copy.py |
Zero-copy model sharing across processes via CUDA IPC |
zero_copy.py |
GPU-to-GPU async weight transfer using NCCL/NVLink |
py_nccl.cc |
Python binding for NCCL unique ID generation |
For mode details look at README_CUDA.md
- Minimum 4 GPUs (8+ recommended, CUDA Compute Capability 8.0+)
- Python 3.10.14
- CUDA Toolkit 12.1 (
nvcc
version:release 12.1, V12.1.66
) - CUDA Driver β₯ 525.105.17
- System CUDA version: 12.0
- vLLM version 0.6.1.post2
!wget https://vbdai-notebooks.obs.cn-north-4.myhuaweicloud.com/epdserve/code.zip
!unzip -qo code.zip
cd EPD-Disaggregation
conda env create -f env.yml
# Get path to vllm folder
python -c "import vllm; print(vllm.__path__)"
# Output -> ['<path_to_vllm>']
code <path_to_vllm>/transformers_utils/image_processor.py
trust_remote_code: bool = True,
To enable fast inter-stage communication via shared memory:
cd csrc
mkdir build && cd build
cmake ..
make
This generates the libcuda_ipc_utils.so
needed for running.
Go the the root folder
cd <project_root>
Run end-to-end disaggregated inference with offline requests:
export CUDA_IPC_LIB_PATH=./csrc/build/libcuda_ipc_utils.so
python offline.py
Start the server:
export CUDA_IPC_LIB_PATH=./csrc/build/libcuda_ipc_utils.so
sh launch_server.sh
Send a request (in a separate terminal):
export CUDA_IPC_LIB_PATH=./csrc/build/libcuda_ipc_utils.so
python online.py
cd baselines/pd
export CUDA_IPC_LIB_PATH=./csrc/build/libcuda_ipc_utils.so
# Offline
python offline.py
# Online (same as above)
sh launch_server.sh
python online.py
cd baselines/d
export CUDA_IPC_LIB_PATH=./csrc/build/libcuda_ipc_utils.so
# Offline
python offline.py
# Online (same as above)
sh launch_server.sh
python online.py
cd <project_root>
# Run all baselines. Can configure request rates, model, #images, etc.
python automate_e2e.py
# Plot (TTFT and SLO) and analyze results
cd experiments/intra_req_dp_e2e
python analyze.ipynb
# Run without dynamic role swiyching
## Change line 44 to 'orchestrator_class = EPDOrchestrator'
python offline.py # e2e time ~68s
# Run with dynamic role switching
## Change line 44 to 'orchestrator_class = DynamicEPDOrchestrator'
python offline.py #e2e time ~25s
The code is implemented based on Dist-Serve, and vLLM. We thanks the contributors for their great work!
If this code is useful, please cite it in your documents.
@misc{singh2025efficientlyservinglargemultimodal,
title={Efficiently Serving Large Multimodal Models Using EPD Disaggregation},
author={Gursimran Singh and Xinglu Wang and Yifan Hu and Timothy Yu and Linzi Xing and Wei Jiang and Zhefeng Wang and Xiaolong Bai and Yi Li and Ying Xiong and Yong Zhang and Zhenan Fan},
year={2025},
eprint={2501.05460},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2501.05460},
}
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.