|
| 1 | +# LMCache supports gpt-oss (20B/120B) on Day 1 |
| 2 | + |
| 3 | +> Source: [LMCache blog website](https://blog.lmcache.ai/2025-08-05-gpt-oss-support/) |
| 4 | +
|
| 5 | +LMCache is an LLM serving engine extension to reduce TTFT and increase throughput, especially |
| 6 | +under long-context scenarios. By storing the KV caches of reusable texts across various locations, |
| 7 | +including (GPU, CPU DRAM, Local Disk), LMCache reuses the KV caches of any reused text |
| 8 | +(not necessarily prefix) in any serving engine instance. Thus, LMCache saves precious |
| 9 | +GPU cycles and reduces user response delay. |
| 10 | + |
| 11 | +By combining LMCache with vLLM, developers achieve 3-10x delay savings and GPU cycle |
| 12 | +reduction in many LLM use cases, including multi-round QA and RAG. |
| 13 | + |
| 14 | +LMCache now supports OpenAI’s newly released GPT-OSS models (20B and 120B parameters) |
| 15 | +from day one! This post provides a complete guide to setting up vLLM with LMCache for |
| 16 | +GPT-OSS models and demonstrates significant performance improvements through our CPU |
| 17 | +offloading capabilities. |
| 18 | + |
| 19 | + |
| 20 | + |
| 21 | +## Step 1: Installing vLLM GPT OSS Version |
| 22 | + |
| 23 | +### Installation |
| 24 | + |
| 25 | +```shell |
| 26 | +uv pip install --pre vllm==0.10.1+gptoss \ |
| 27 | + --extra-index-url https://wheels.vllm.ai/gpt-oss/ \ |
| 28 | + --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \ |
| 29 | + --index-strategy unsafe-best-match |
| 30 | +``` |
| 31 | + |
| 32 | +### Test the Installation |
| 33 | + |
| 34 | +```shell |
| 35 | +vllm serve openai/gpt-oss-120b \ |
| 36 | + --max-model-len 32768 \ |
| 37 | + --disable-hybrid-kv-cache-manager |
| 38 | +``` |
| 39 | +```shell |
| 40 | +curl http://localhost:9000/v1/chat/completions \ |
| 41 | + -H "Content-Type: application/json" \ |
| 42 | + -d '{ |
| 43 | + "model": "openai/gpt-oss-120b", |
| 44 | + "messages": [ |
| 45 | + { |
| 46 | + "role": "user", |
| 47 | + "content": "Hello how are you today" |
| 48 | + } |
| 49 | + ], |
| 50 | + "temperature": 0.7 |
| 51 | + }' |
| 52 | +``` |
| 53 | + |
| 54 | +## Step 2: Install LMCache from Source |
| 55 | + |
| 56 | +### Why Install from Source? |
| 57 | + |
| 58 | +vLLM requires nightly built PyTorch to serve GPT models. To ensure compatibility, we highly recommend installing LMCache based on the PyTorch version in your current virtual environment. |
| 59 | + |
| 60 | +### Installation Process |
| 61 | + |
| 62 | +Install LMCache from source (this command may take a few minutes due to CUDA kernel compilations): |
| 63 | + |
| 64 | +```shell |
| 65 | +git clone https://github.com/LMCache/LMCache.git |
| 66 | +cd LMCache |
| 67 | + |
| 68 | +# In your virtual environment |
| 69 | +ENABLE_CXX11_ABI=1 uv pip install -e . --no-build-isolation |
| 70 | +``` |
| 71 | + |
| 72 | +### Test the Installation |
| 73 | + |
| 74 | +```shell |
| 75 | +python3 -c "import torch; import lmcache; import lmcache.c_ops" |
| 76 | +``` |
| 77 | + |
| 78 | +## Step 3: Run vLLM with LMCache |
| 79 | + |
| 80 | +### LMCache Configuration |
| 81 | + |
| 82 | +Create a configuration file backend_cpu.yaml for CPU offloading: |
| 83 | + |
| 84 | +```yaml |
| 85 | +# Create a CPU offloading buffer with 80G |
| 86 | +chunk_size: 256 |
| 87 | +local_cpu: True |
| 88 | +max_local_cpu_size: 80 |
| 89 | +``` |
| 90 | +
|
| 91 | +### Launch vLLM with LMCache |
| 92 | +
|
| 93 | +```shell |
| 94 | +LMCACHE_CONFIG_FILE="./backend_cpu.yaml" \ |
| 95 | +LMCACHE_USE_EXPERIMENTAL=True \ |
| 96 | +vllm serve \ |
| 97 | + openai/gpt-oss-120b \ |
| 98 | + --max-model-len 32768 \ |
| 99 | + --disable-log-requests \ |
| 100 | + --disable-hybrid-kv-cache-manager \ |
| 101 | + --kv-transfer-config \ |
| 102 | + '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' |
| 103 | +``` |
| 104 | + |
| 105 | +## Step 4: Benchmark Results |
| 106 | + |
| 107 | +### Use Case: Long Document Q&A |
| 108 | + |
| 109 | +- Input: 20 different documents with an average length of 20K tokens each |
| 110 | +- Output: 50 tokens per query |
| 111 | + |
| 112 | +1. Phase 1: Send all documents to the serving engines to warm up the KV cache |
| 113 | +1. Phase 2: Shuffle the queries and send them again, measuring TTFT and finish time |
| 114 | + |
| 115 | +### Performance Results |
| 116 | + |
| 117 | +The benchmark results for Phase 2 show impressive improvements: |
| 118 | + |
| 119 | +| Setup | Average TTFT (secs) | Time to finish all queries (secs) | |
| 120 | +| ----------------- | ------------------- | --------------------------------- | |
| 121 | +| Vanilla vLLM | 1.20 | 15.70 | |
| 122 | +| vLLM + LMCache | 0.39 | 7.73 | |
| 123 | + |
| 124 | +### Why the Performance Gain? |
| 125 | + |
| 126 | +When using a single A100/H100 to serve GPT-120B, the available KV cache GPU buffer is typically less than 10GB. With LMCache’s CPU offloading buffer, vLLM can store and reuse KV cache for many more prefixes, resulting in: |
| 127 | + |
| 128 | +- **67% reduction** in Time to First Token (TTFT) |
| 129 | +- **51% reduction** in total query completion time |
| 130 | + |
| 131 | +### Running the Benchmark |
| 132 | + |
| 133 | +You can reproduce these results using our benchmark script: |
| 134 | + |
| 135 | +```shell |
| 136 | +python long-doc-qa.py --num-documents 20 \ |
| 137 | + --document-length 20000 --output-len 50 \ |
| 138 | + --repeat-count 1 --repeat-mode random \ |
| 139 | + --shuffle-seed 0 |
| 140 | +``` |
| 141 | + |
| 142 | +## References |
| 143 | + |
| 144 | +- [lmcache.ai website](https://lmcache.ai/) |
| 145 | +- [LMCache repo](https://github.com/LMCache/LMCache) |
| 146 | +- Check the [complete benchmark script](https://github.com/LMCache/LMCache/blob/dev/benchmarks/long-doc-qa/long-doc-qa.py). |
0 commit comments