Merge pull request d-run#416 from windsonsea/lmcach

windsonsea · web-flow · commit 9b060237a713 · 2025-08-13T10:45:40.000+08:00
add a blog: 2025/lmcache.md
diff --git a/docs/zh/docs/blogs/2025/images/lmcache01.png b/docs/zh/docs/blogs/2025/images/lmcache01.png
diff --git a/docs/zh/docs/blogs/2025/lmcache.md b/docs/zh/docs/blogs/2025/lmcache.md
@@ -0,0 +1,143 @@
+# LMCache 上线即支持 GPT-OSS（20B/120B）
+
+> 英文原稿：[LMCache 博客网站](https://blog.lmcache.ai/2025-08-05-gpt-oss-support/)
+
+LMCache 是一种大型语言模型（LLM）推理引擎扩展，用于减少 TTFT（首个 Token 延迟）并提升吞吐量，尤其适用于长上下文场景。
+它通过将可复用文本的 KV 缓存存储在多个位置（包括 GPU、CPU 内存、本地磁盘）中，实现跨任意推理引擎实例复用这些缓存（不仅限于前缀文本）。
+因此，LMCache 能够节省宝贵的 GPU 计算资源，并减少用户响应延迟。
+
+将 LMCache 与 vLLM 结合使用，在多轮问答（QA）和 RAG 等多种 LLM 场景中，开发者可实现延迟和 GPU 计算量的 3~10 倍优化。
+
+LMCache 上线就支持 OpenAI 新发布的 GPT-OSS 模型（20B 和 120B 参数）！本文将提供为 GPT-OSS
+模型配置 vLLM 与 LMCache 的完整步骤，并通过 CPU 缓存卸载功能展示显著的性能提升。
+
+![LMCache GPT-OSS 集成](./images/lmcache01.png)
+
+## 第一步：安装 vLLM GPT OSS 版本
+
+### 安装
+
+```shell
+uv pip install --pre vllm==0.10.1+gptoss \
+    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
+    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
+    --index-strategy unsafe-best-match
+```
+
+### 测试安装
+
+```shell
+vllm serve openai/gpt-oss-120b \
+  --max-model-len 32768 \
+  --disable-hybrid-kv-cache-manager
+```
+
+```shell
+curl http://localhost:9000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openai/gpt-oss-120b",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Hello how are you today"
+      }
+    ],
+    "temperature": 0.7
+  }'
+```
+
+## 第二步：从源码安装 LMCache
+
+### 为什么要从源码安装？
+
+vLLM 需要使用 PyTorch 的 nightly 版本来运行 GPT 模型。为了确保兼容性，我们强烈建议基于当前虚拟环境中的 PyTorch 版本安装 LMCache。
+
+### 安装步骤
+
+从源码安装 LMCache（该命令可能需要几分钟时间，因为会编译 CUDA 内核）：
+
+```shell
+git clone https://github.com/LMCache/LMCache.git
+cd LMCache
+
+# 在你的虚拟环境中
+ENABLE_CXX11_ABI=1 uv pip install -e . --no-build-isolation
+```
+
+### 测试安装
+
+```shell
+python3 -c "import torch; import lmcache; import lmcache.c_ops"
+```
+
+## 第三步：运行 vLLM 与 LMCache
+
+### LMCache 配置
+
+创建一个 backend\_cpu.yaml 配置文件用于 CPU 缓存卸载：
+
+```yaml
+# 创建一个 80G 的 CPU 缓存区
+chunk_size: 256
+local_cpu: True
+max_local_cpu_size: 80
+```
+
+### 启动 vLLM 与 LMCache
+
+```shell
+LMCACHE_CONFIG_FILE="./backend_cpu.yaml" \
+LMCACHE_USE_EXPERIMENTAL=True \
+vllm serve \
+    openai/gpt-oss-120b \
+    --max-model-len 32768 \
+    --disable-log-requests \
+    --disable-hybrid-kv-cache-manager \
+    --kv-transfer-config \
+    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
+```
+
+## 第四步：性能基准测试结果
+
+### 使用场景：长文档问答
+
+* 输入：20 个不同的文档，每个文档平均长度 2 万个 token
+* 输出：每个查询 50 个 token
+
+1. 阶段 1：将所有文档发送到推理引擎，预热 KV 缓存
+2. 阶段 2：打乱查询顺序再次发送，并测量 TTFT 和完成时间
+
+### 性能结果
+
+阶段 2 的基准测试结果显示性能有显著提升：
+
+| 配置             | 平均 TTFT（秒） | 完成所有查询时间（秒） |
+| -------------- | ---------- | ----------- |
+| 原生 vLLM        | 1.20       | 15.70       |
+| vLLM + LMCache | 0.39       | 7.73        |
+
+### 为什么性能提升显著？
+
+在使用单张 A100/H100 运行 GPT-120B 时，可用的 KV 缓存 GPU 缓冲区通常小于 10GB。
+通过 LMCache 的 CPU 缓存卸载功能，vLLM 能存储和复用更多前缀的 KV 缓存，从而实现：
+
+* **TTFT 减少 67%**
+* **总查询完成时间减少 51%**
+
+### 运行基准测试
+
+你可以使用我们的基准测试脚本复现这些结果：
+
+```shell
+python long-doc-qa.py --num-documents 20 \
+  --document-length 20000 --output-len 50 \
+  --repeat-count 1 --repeat-mode random \
+  --shuffle-seed 0
+```
+
+## 参考资料
+
+* [lmcache.ai 官网](https://lmcache.ai/)
+* [LMCache 仓库](https://github.com/LMCache/LMCache)
+* 查看[完整基准测试脚本](https://github.com/LMCache/LMCache/blob/dev/benchmarks/long-doc-qa/long-doc-qa.py)。
diff --git a/docs/zh/docs/blogs/index.md b/docs/zh/docs/blogs/index.md
@@ -7,6 +7,11 @@ hide:
 
 本频道将紧跟技术趋势，收集 AI 行业新闻。
 
+* [LMCache 上线即支持 GPT-OSS（20B/120B）](./2025/lmcache.md)
+
+    LMCache 上线就支持 OpenAI 新发布的 GPT-OSS 模型（20B 和 120B 参数）！本文将提供为 GPT-OSS
+    模型配置 vLLM 与 LMCache 的完整步骤，并通过 CPU 缓存卸载功能展示显著的性能提升。
+
 * [FlowSpeech：全球首个书面语转口语的 TTS](./2025/flowspeech.md)
 
     人工智能语音合成技术迎来新突破。一款名为 FlowSpeech 的 AI 文本转语音工具正式发布，
diff --git a/docs/zh/docs/en/blogs/2025/images/lmcache01.png b/docs/zh/docs/en/blogs/2025/images/lmcache01.png
diff --git a/docs/zh/docs/en/blogs/2025/lmcache.md b/docs/zh/docs/en/blogs/2025/lmcache.md
@@ -0,0 +1,146 @@
+# LMCache supports gpt-oss (20B/120B) on Day 1
+
+> Source: [LMCache blog website](https://blog.lmcache.ai/2025-08-05-gpt-oss-support/)
+
+LMCache is an LLM serving engine extension to reduce TTFT and increase throughput, especially
+under long-context scenarios. By storing the KV caches of reusable texts across various locations,
+including (GPU, CPU DRAM, Local Disk), LMCache reuses the KV caches of any reused text
+(not necessarily prefix) in any serving engine instance. Thus, LMCache saves precious
+GPU cycles and reduces user response delay.
+
+By combining LMCache with vLLM, developers achieve 3-10x delay savings and GPU cycle
+reduction in many LLM use cases, including multi-round QA and RAG.
+
+LMCache now supports OpenAI’s newly released GPT-OSS models (20B and 120B parameters)
+from day one! This post provides a complete guide to setting up vLLM with LMCache for
+GPT-OSS models and demonstrates significant performance improvements through our CPU
+offloading capabilities.
+
+![LMCache GPT-OSS Integration](./images/lmcache01.png)
+
+## Step 1: Installing vLLM GPT OSS Version
+
+### Installation
+
+```shell
+uv pip install --pre vllm==0.10.1+gptoss \
+    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
+    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
+    --index-strategy unsafe-best-match
+```
+
+### Test the Installation
+
+```shell
+vllm serve openai/gpt-oss-120b \
+  --max-model-len 32768 \
+  --disable-hybrid-kv-cache-manager
+```
+```shell
+curl http://localhost:9000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openai/gpt-oss-120b",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Hello how are you today"
+      }
+    ],
+    "temperature": 0.7
+  }'
+```
+
+## Step 2: Install LMCache from Source
+
+### Why Install from Source?
+
+vLLM requires nightly built PyTorch to serve GPT models. To ensure compatibility, we highly recommend installing LMCache based on the PyTorch version in your current virtual environment.
+
+### Installation Process
+
+Install LMCache from source (this command may take a few minutes due to CUDA kernel compilations):
+
+```shell
+git clone https://github.com/LMCache/LMCache.git
+cd LMCache
+
+# In your virtual environment
+ENABLE_CXX11_ABI=1 uv pip install -e . --no-build-isolation
+```
+
+### Test the Installation
+
+```shell
+python3 -c "import torch; import lmcache; import lmcache.c_ops"
+```
+
+## Step 3: Run vLLM with LMCache
+
+### LMCache Configuration
+
+Create a configuration file backend_cpu.yaml for CPU offloading:
+
+```yaml
+# Create a CPU offloading buffer with 80G
+chunk_size: 256
+local_cpu: True
+max_local_cpu_size: 80
+```
+
+### Launch vLLM with LMCache
+
+```shell
+LMCACHE_CONFIG_FILE="./backend_cpu.yaml" \
+LMCACHE_USE_EXPERIMENTAL=True \
+vllm serve \
+    openai/gpt-oss-120b \
+    --max-model-len 32768 \
+    --disable-log-requests \
+    --disable-hybrid-kv-cache-manager \
+    --kv-transfer-config \
+    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
+```
+
+## Step 4: Benchmark Results
+
+### Use Case: Long Document Q&A
+
+- Input: 20 different documents with an average length of 20K tokens each
+- Output: 50 tokens per query
+
+1. Phase 1: Send all documents to the serving engines to warm up the KV cache
+1. Phase 2: Shuffle the queries and send them again, measuring TTFT and finish time
+
+### Performance Results
+
+The benchmark results for Phase 2 show impressive improvements:
+
+| Setup             | Average TTFT (secs) | Time to finish all queries (secs) |
+| ----------------- | ------------------- | --------------------------------- |
+| Vanilla vLLM      | 1.20                 | 15.70                             |
+| vLLM + LMCache    | 0.39                 | 7.73                              |
+
+### Why the Performance Gain?
+
+When using a single A100/H100 to serve GPT-120B, the available KV cache GPU buffer is typically less than 10GB. With LMCache’s CPU offloading buffer, vLLM can store and reuse KV cache for many more prefixes, resulting in:
+
+- **67% reduction** in Time to First Token (TTFT)
+- **51% reduction** in total query completion time
+
+### Running the Benchmark
+
+You can reproduce these results using our benchmark script:
+
+```shell
+python long-doc-qa.py --num-documents 20 \
+  --document-length 20000 --output-len 50 \
+  --repeat-count 1 --repeat-mode random \
+  --shuffle-seed 0
+```
+
+## References
+
+- [lmcache.ai website](https://lmcache.ai/)
+- [LMCache repo](https://github.com/LMCache/LMCache)
+- Check the [complete benchmark script](https://github.com/LMCache/LMCache/blob/dev/benchmarks/long-doc-qa/long-doc-qa.py).
diff --git a/docs/zh/docs/en/blogs/index.md b/docs/zh/docs/en/blogs/index.md
@@ -7,6 +7,13 @@ hide:
 
 This channel will closely follow technology trends and collect news from the AI industry.
 
+- [LMCache supports gpt-oss (20B/120B) on Day 1](./2025/lmcache.md)
+
+    LMCache now supports OpenAI’s newly released GPT-OSS models (20B and 120B parameters)
+    from day one! This post provides a complete guide to setting up vLLM with LMCache for
+    GPT-OSS models and demonstrates significant performance improvements through our CPU
+    offloading capabilities.
+
 - [FlowSpeech: The World’s First TTS Converting Written Language into Spoken Language](./2025/flowspeech.md)
 
     Artificial intelligence voice synthesis technology has reached a new breakthrough. An AI text-to-speech tool named FlowSpeech has been officially released, distinguished by its ability to convert written text into natural, fluent spoken language, providing users with a voice synthesis experience closer to real conversation.
diff --git a/docs/zh/navigation.yml b/docs/zh/navigation.yml
@@ -85,6 +85,7 @@ nav:
       - 费用中心: videos/bills.md
   - AI 行业新闻:
       - 索引: blogs/index.md
+      - LMCache 上线即支持 GPT-OSS: blogs/2025/flowspeech.md
       - FlowSpeech 书面语转口语: blogs/2025/flowspeech.md
       - GPT-5 正式发布: blogs/2025/gpt5.md
       - d.run 上新 DeepSeek-R1-0528: blogs/2025/0603-deepseek-0528.md
@@ -261,3 +262,4 @@ plugins:
                 GPT-5 正式发布: GPT-5 Officially Released
                 d.run 上新 DeepSeek-R1-0528: d.run Launches DeepSeek-R1-0528
                 FlowSpeech 书面语转口语: FlowSpeech Converts Text into Speech
+                LMCache 上线即支持 GPT-OSS: LMCache supports gpt-oss