Skip to content

Commit 9b06023

Browse files
authored
Merge pull request d-run#416 from windsonsea/lmcach
add a blog: 2025/lmcache.md
2 parents b530ee0 + b487504 commit 9b06023

File tree

7 files changed

+303
-0
lines changed

7 files changed

+303
-0
lines changed
143 KB
Loading

docs/zh/docs/blogs/2025/lmcache.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# LMCache 上线即支持 GPT-OSS(20B/120B)
2+
3+
> 英文原稿:[LMCache 博客网站](https://blog.lmcache.ai/2025-08-05-gpt-oss-support/)
4+
5+
LMCache 是一种大型语言模型(LLM)推理引擎扩展,用于减少 TTFT(首个 Token 延迟)并提升吞吐量,尤其适用于长上下文场景。
6+
它通过将可复用文本的 KV 缓存存储在多个位置(包括 GPU、CPU 内存、本地磁盘)中,实现跨任意推理引擎实例复用这些缓存(不仅限于前缀文本)。
7+
因此,LMCache 能够节省宝贵的 GPU 计算资源,并减少用户响应延迟。
8+
9+
将 LMCache 与 vLLM 结合使用,在多轮问答(QA)和 RAG 等多种 LLM 场景中,开发者可实现延迟和 GPU 计算量的 3~10 倍优化。
10+
11+
LMCache 上线就支持 OpenAI 新发布的 GPT-OSS 模型(20B 和 120B 参数)!本文将提供为 GPT-OSS
12+
模型配置 vLLM 与 LMCache 的完整步骤,并通过 CPU 缓存卸载功能展示显著的性能提升。
13+
14+
![LMCache GPT-OSS 集成](./images/lmcache01.png)
15+
16+
## 第一步:安装 vLLM GPT OSS 版本
17+
18+
### 安装
19+
20+
```shell
21+
uv pip install --pre vllm==0.10.1+gptoss \
22+
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
23+
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
24+
--index-strategy unsafe-best-match
25+
```
26+
27+
### 测试安装
28+
29+
```shell
30+
vllm serve openai/gpt-oss-120b \
31+
--max-model-len 32768 \
32+
--disable-hybrid-kv-cache-manager
33+
```
34+
35+
```shell
36+
curl http://localhost:9000/v1/chat/completions \
37+
-H "Content-Type: application/json" \
38+
-d '{
39+
"model": "openai/gpt-oss-120b",
40+
"messages": [
41+
{
42+
"role": "user",
43+
"content": "Hello how are you today"
44+
}
45+
],
46+
"temperature": 0.7
47+
}'
48+
```
49+
50+
## 第二步:从源码安装 LMCache
51+
52+
### 为什么要从源码安装?
53+
54+
vLLM 需要使用 PyTorch 的 nightly 版本来运行 GPT 模型。为了确保兼容性,我们强烈建议基于当前虚拟环境中的 PyTorch 版本安装 LMCache。
55+
56+
### 安装步骤
57+
58+
从源码安装 LMCache(该命令可能需要几分钟时间,因为会编译 CUDA 内核):
59+
60+
```shell
61+
git clone https://github.com/LMCache/LMCache.git
62+
cd LMCache
63+
64+
# 在你的虚拟环境中
65+
ENABLE_CXX11_ABI=1 uv pip install -e . --no-build-isolation
66+
```
67+
68+
### 测试安装
69+
70+
```shell
71+
python3 -c "import torch; import lmcache; import lmcache.c_ops"
72+
```
73+
74+
## 第三步:运行 vLLM 与 LMCache
75+
76+
### LMCache 配置
77+
78+
创建一个 backend\_cpu.yaml 配置文件用于 CPU 缓存卸载:
79+
80+
```yaml
81+
# 创建一个 80G 的 CPU 缓存区
82+
chunk_size: 256
83+
local_cpu: True
84+
max_local_cpu_size: 80
85+
```
86+
87+
### 启动 vLLM 与 LMCache
88+
89+
```shell
90+
LMCACHE_CONFIG_FILE="./backend_cpu.yaml" \
91+
LMCACHE_USE_EXPERIMENTAL=True \
92+
vllm serve \
93+
openai/gpt-oss-120b \
94+
--max-model-len 32768 \
95+
--disable-log-requests \
96+
--disable-hybrid-kv-cache-manager \
97+
--kv-transfer-config \
98+
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
99+
```
100+
101+
## 第四步:性能基准测试结果
102+
103+
### 使用场景:长文档问答
104+
105+
* 输入:20 个不同的文档,每个文档平均长度 2 万个 token
106+
* 输出:每个查询 50 个 token
107+
108+
1. 阶段 1:将所有文档发送到推理引擎,预热 KV 缓存
109+
2. 阶段 2:打乱查询顺序再次发送,并测量 TTFT 和完成时间
110+
111+
### 性能结果
112+
113+
阶段 2 的基准测试结果显示性能有显著提升:
114+
115+
| 配置 | 平均 TTFT(秒) | 完成所有查询时间(秒) |
116+
| -------------- | ---------- | ----------- |
117+
| 原生 vLLM | 1.20 | 15.70 |
118+
| vLLM + LMCache | 0.39 | 7.73 |
119+
120+
### 为什么性能提升显著?
121+
122+
在使用单张 A100/H100 运行 GPT-120B 时,可用的 KV 缓存 GPU 缓冲区通常小于 10GB。
123+
通过 LMCache 的 CPU 缓存卸载功能,vLLM 能存储和复用更多前缀的 KV 缓存,从而实现:
124+
125+
* **TTFT 减少 67%**
126+
* **总查询完成时间减少 51%**
127+
128+
### 运行基准测试
129+
130+
你可以使用我们的基准测试脚本复现这些结果:
131+
132+
```shell
133+
python long-doc-qa.py --num-documents 20 \
134+
--document-length 20000 --output-len 50 \
135+
--repeat-count 1 --repeat-mode random \
136+
--shuffle-seed 0
137+
```
138+
139+
## 参考资料
140+
141+
* [lmcache.ai 官网](https://lmcache.ai/)
142+
* [LMCache 仓库](https://github.com/LMCache/LMCache)
143+
* 查看[完整基准测试脚本](https://github.com/LMCache/LMCache/blob/dev/benchmarks/long-doc-qa/long-doc-qa.py)

docs/zh/docs/blogs/index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,11 @@ hide:
77

88
本频道将紧跟技术趋势,收集 AI 行业新闻。
99

10+
* [LMCache 上线即支持 GPT-OSS(20B/120B)](./2025/lmcache.md)
11+
12+
LMCache 上线就支持 OpenAI 新发布的 GPT-OSS 模型(20B 和 120B 参数)!本文将提供为 GPT-OSS
13+
模型配置 vLLM 与 LMCache 的完整步骤,并通过 CPU 缓存卸载功能展示显著的性能提升。
14+
1015
* [FlowSpeech:全球首个书面语转口语的 TTS](./2025/flowspeech.md)
1116

1217
人工智能语音合成技术迎来新突破。一款名为 FlowSpeech 的 AI 文本转语音工具正式发布,
143 KB
Loading
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# LMCache supports gpt-oss (20B/120B) on Day 1
2+
3+
> Source: [LMCache blog website](https://blog.lmcache.ai/2025-08-05-gpt-oss-support/)
4+
5+
LMCache is an LLM serving engine extension to reduce TTFT and increase throughput, especially
6+
under long-context scenarios. By storing the KV caches of reusable texts across various locations,
7+
including (GPU, CPU DRAM, Local Disk), LMCache reuses the KV caches of any reused text
8+
(not necessarily prefix) in any serving engine instance. Thus, LMCache saves precious
9+
GPU cycles and reduces user response delay.
10+
11+
By combining LMCache with vLLM, developers achieve 3-10x delay savings and GPU cycle
12+
reduction in many LLM use cases, including multi-round QA and RAG.
13+
14+
LMCache now supports OpenAI’s newly released GPT-OSS models (20B and 120B parameters)
15+
from day one! This post provides a complete guide to setting up vLLM with LMCache for
16+
GPT-OSS models and demonstrates significant performance improvements through our CPU
17+
offloading capabilities.
18+
19+
![LMCache GPT-OSS Integration](./images/lmcache01.png)
20+
21+
## Step 1: Installing vLLM GPT OSS Version
22+
23+
### Installation
24+
25+
```shell
26+
uv pip install --pre vllm==0.10.1+gptoss \
27+
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
28+
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
29+
--index-strategy unsafe-best-match
30+
```
31+
32+
### Test the Installation
33+
34+
```shell
35+
vllm serve openai/gpt-oss-120b \
36+
--max-model-len 32768 \
37+
--disable-hybrid-kv-cache-manager
38+
```
39+
```shell
40+
curl http://localhost:9000/v1/chat/completions \
41+
-H "Content-Type: application/json" \
42+
-d '{
43+
"model": "openai/gpt-oss-120b",
44+
"messages": [
45+
{
46+
"role": "user",
47+
"content": "Hello how are you today"
48+
}
49+
],
50+
"temperature": 0.7
51+
}'
52+
```
53+
54+
## Step 2: Install LMCache from Source
55+
56+
### Why Install from Source?
57+
58+
vLLM requires nightly built PyTorch to serve GPT models. To ensure compatibility, we highly recommend installing LMCache based on the PyTorch version in your current virtual environment.
59+
60+
### Installation Process
61+
62+
Install LMCache from source (this command may take a few minutes due to CUDA kernel compilations):
63+
64+
```shell
65+
git clone https://github.com/LMCache/LMCache.git
66+
cd LMCache
67+
68+
# In your virtual environment
69+
ENABLE_CXX11_ABI=1 uv pip install -e . --no-build-isolation
70+
```
71+
72+
### Test the Installation
73+
74+
```shell
75+
python3 -c "import torch; import lmcache; import lmcache.c_ops"
76+
```
77+
78+
## Step 3: Run vLLM with LMCache
79+
80+
### LMCache Configuration
81+
82+
Create a configuration file backend_cpu.yaml for CPU offloading:
83+
84+
```yaml
85+
# Create a CPU offloading buffer with 80G
86+
chunk_size: 256
87+
local_cpu: True
88+
max_local_cpu_size: 80
89+
```
90+
91+
### Launch vLLM with LMCache
92+
93+
```shell
94+
LMCACHE_CONFIG_FILE="./backend_cpu.yaml" \
95+
LMCACHE_USE_EXPERIMENTAL=True \
96+
vllm serve \
97+
openai/gpt-oss-120b \
98+
--max-model-len 32768 \
99+
--disable-log-requests \
100+
--disable-hybrid-kv-cache-manager \
101+
--kv-transfer-config \
102+
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
103+
```
104+
105+
## Step 4: Benchmark Results
106+
107+
### Use Case: Long Document Q&A
108+
109+
- Input: 20 different documents with an average length of 20K tokens each
110+
- Output: 50 tokens per query
111+
112+
1. Phase 1: Send all documents to the serving engines to warm up the KV cache
113+
1. Phase 2: Shuffle the queries and send them again, measuring TTFT and finish time
114+
115+
### Performance Results
116+
117+
The benchmark results for Phase 2 show impressive improvements:
118+
119+
| Setup | Average TTFT (secs) | Time to finish all queries (secs) |
120+
| ----------------- | ------------------- | --------------------------------- |
121+
| Vanilla vLLM | 1.20 | 15.70 |
122+
| vLLM + LMCache | 0.39 | 7.73 |
123+
124+
### Why the Performance Gain?
125+
126+
When using a single A100/H100 to serve GPT-120B, the available KV cache GPU buffer is typically less than 10GB. With LMCache’s CPU offloading buffer, vLLM can store and reuse KV cache for many more prefixes, resulting in:
127+
128+
- **67% reduction** in Time to First Token (TTFT)
129+
- **51% reduction** in total query completion time
130+
131+
### Running the Benchmark
132+
133+
You can reproduce these results using our benchmark script:
134+
135+
```shell
136+
python long-doc-qa.py --num-documents 20 \
137+
--document-length 20000 --output-len 50 \
138+
--repeat-count 1 --repeat-mode random \
139+
--shuffle-seed 0
140+
```
141+
142+
## References
143+
144+
- [lmcache.ai website](https://lmcache.ai/)
145+
- [LMCache repo](https://github.com/LMCache/LMCache)
146+
- Check the [complete benchmark script](https://github.com/LMCache/LMCache/blob/dev/benchmarks/long-doc-qa/long-doc-qa.py).

docs/zh/docs/en/blogs/index.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,13 @@ hide:
77

88
This channel will closely follow technology trends and collect news from the AI industry.
99

10+
- [LMCache supports gpt-oss (20B/120B) on Day 1](./2025/lmcache.md)
11+
12+
LMCache now supports OpenAI’s newly released GPT-OSS models (20B and 120B parameters)
13+
from day one! This post provides a complete guide to setting up vLLM with LMCache for
14+
GPT-OSS models and demonstrates significant performance improvements through our CPU
15+
offloading capabilities.
16+
1017
- [FlowSpeech: The World’s First TTS Converting Written Language into Spoken Language](./2025/flowspeech.md)
1118

1219
Artificial intelligence voice synthesis technology has reached a new breakthrough. An AI text-to-speech tool named FlowSpeech has been officially released, distinguished by its ability to convert written text into natural, fluent spoken language, providing users with a voice synthesis experience closer to real conversation.

docs/zh/navigation.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ nav:
8585
- 费用中心: videos/bills.md
8686
- AI 行业新闻:
8787
- 索引: blogs/index.md
88+
- LMCache 上线即支持 GPT-OSS: blogs/2025/flowspeech.md
8889
- FlowSpeech 书面语转口语: blogs/2025/flowspeech.md
8990
- GPT-5 正式发布: blogs/2025/gpt5.md
9091
- d.run 上新 DeepSeek-R1-0528: blogs/2025/0603-deepseek-0528.md
@@ -261,3 +262,4 @@ plugins:
261262
GPT-5 正式发布: GPT-5 Officially Released
262263
d.run 上新 DeepSeek-R1-0528: d.run Launches DeepSeek-R1-0528
263264
FlowSpeech 书面语转口语: FlowSpeech Converts Text into Speech
265+
LMCache 上线即支持 GPT-OSS: LMCache supports gpt-oss

0 commit comments

Comments
 (0)