/ˈCHirəp/ — (especially of a small bird) make repeated short high-pitched sounds; twitter.
Chirrup is a high-performance inference frontend for RWKV models, built on top of Albatross.
| GPU Configuration | Model | Workers | BSZ/Worker | Total Concurrent Requests | TPS per Request |
|---|---|---|---|---|---|
| 4 × RTX 4090 24GB | 7.2B | 4 | 200 | 800 | 16 tps |
| 4 × Tesla V100 16GB | 7.2B | 4 | 34 | 136 | 17 tps |
Note: The RTX 4090 configuration is far from the GPU's processing limits, with significant optimization potential remaining.
- High Performance: Leverages the blazing-fast inference engine from Albatross.
- Continuous Batching: Maximizes GPU utilization by dynamically batching incoming requests.
- State Cache: Reuses computation states for long-context inputs, significantly improving throughput as context length increases.
- OpenAI-Compatible API: Drop-in replacement for existing LLM workflows — no code changes needed.
- CUDA Graph support for reduced kernel launch overhead
- Prefill-Decode separation for optimized scheduling
- Constrained decoding (e.g., JSON schema)
- Function Calling support
- Pipeline parallelism to enable inference of even larger models
Visit the official model hub and download a RWKV-7 g1 series model that fits your needs:
👉 https://huggingface.co/BlinkDL/rwkv7-g1/tree/main
For best performance, we strongly recommend using Python 3.14t (Free threading) via uv.
# Clone the repository
git clone --recurse-submodules https://github.com/leonsama/chirrup.git
# Create a Python 3.14t virtual environment
uv venv --python 3.14t
# Activate it
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Install Chirrup
uv pip install -e .
# Install dependencies with CUDA 12.9 support and dev tools
uv sync --extra torch-cu129 --dev💡 You may use
torch-cu126instead if your system requires it, or customize the PyTorch backend inpyproject.toml.
If you are ROCm device, you need to use this script to install dependencies
git clone --recurse-submodules https://github.com/leonsama/chirrup.git
uv venv --python 3.14t
source .venv/bin/activate
uv sync --extra dev
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.4# Currently, `triton._C.libtriton` doesn't declare itself GIL-safe, but it actually works fine—so we
# manually disable the GIL with `PYTHON_GIL=0`.
PYTHON_GIL=0 uv run --frozen python -m chirrup.web_service.app --model_path /path/to/your/modelThe service will start at http://127.0.0.1:8000, providing OpenAI-compatible API endpoints.
📖 Detailed Documentation: Check Chirrup API Documentation for complete command-line parameters and API interface documentation.
PYTHON_GIL=0 uv run --frozen test/demo_stream_output.py --model_path /path/to/your/modelCode Example:
from chirrup.engine_core import AsyncEngineCore
from chirrup.core_structure import ModelLoadConfig
model_config = ModelLoadConfig(
model_path=model_path,
vocab_path="../Albatross/reference/rwkv_vocab_v20230424.txt",
vocab_size=65536,
head_size=64,
)
engine_core = AsyncEngineCore()
await engine_core.init(worker_num=1, model_config=model_config, batch_size=4)
prompt = "User: 为什么 42 是一个有趣的数字?\n\nAssistant:"
completion = engine_core.completion(prompt)
print(prompt, end="", flush=True)
async for event in completion:
if event[0] == "token":
print(event[2], end="", flush=True)PPYTHON_GIL=0 v run --frozen test/demo_batch_output.py --model_path /path/to/your/model --batch_size 32 --task_num 512 --worker_num 4Code Example:
from chirrup.engine_core import AsyncEngineCore
from chirrup.core_structure import ModelLoadConfig
import asyncio
model_config = ModelLoadConfig(
model_path=model_path,
vocab_path="../Albatross/reference/rwkv_vocab_v20230424.txt",
vocab_size=65536,
head_size=64,
)
engine_core = AsyncEngineCore()
await engine_core.init(worker_num=4, model_config=model_config, batch_size=33) # batch_size = max_batch + 1
prompts = [f"User: 为什么 {i} 是一个有趣的数字?\n\nAssistant: <think>\n</think>" for i in range(512)]
results = await asyncio.gather(
*[engine_core.completion(prompt).get_full_completion() for prompt in prompts]
)Contributions are welcome! Please feel free to submit a Pull Request.
- Thanks to RWKV-Vibe/rwkv_lightning for inspiration and to its author Alic for valuable guidance.
- Thanks to Jellyfish for the continuous batching implementation in Albatross.
