Chirrup

/ˈCHirəp/ — (especially of a small bird) make repeated short high-pitched sounds; twitter.

Chirrup is a high-performance inference frontend for RWKV models, built on top of Albatross.

📊 Performance

November 12, 2025

GPU Configuration	Model	Workers	BSZ/Worker	Total Concurrent Requests	TPS per Request
4 × RTX 4090 24GB	7.2B	4	200	800	16 tps
4 × Tesla V100 16GB	7.2B	4	34	136	17 tps

Note: The RTX 4090 configuration is far from the GPU's processing limits, with significant optimization potential remaining.

✨ Features

✅ Implemented

High Performance: Leverages the blazing-fast inference engine from Albatross.
Continuous Batching: Maximizes GPU utilization by dynamically batching incoming requests.
State Cache: Reuses computation states for long-context inputs, significantly improving throughput as context length increases.
OpenAI-Compatible API: Drop-in replacement for existing LLM workflows — no code changes needed.

🔜 Planned

CUDA Graph support for reduced kernel launch overhead
Prefill-Decode separation for optimized scheduling
Constrained decoding (e.g., JSON schema)
Function Calling support
Pipeline parallelism to enable inference of even larger models

🚀 Getting Started

1. Download a Model

Visit the official model hub and download a RWKV-7 g1 series model that fits your needs:
👉 https://huggingface.co/BlinkDL/rwkv7-g1/tree/main

2. Set Up Environment

For best performance, we strongly recommend using Python 3.14t (Free threading) via uv.

# Clone the repository
git clone --recurse-submodules https://github.com/leonsama/chirrup.git

# Create a Python 3.14t virtual environment
uv venv --python 3.14t

# Activate it
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate     # Windows

# Install Chirrup
uv pip install -e .

# Install dependencies with CUDA 12.9 support and dev tools
uv sync --extra torch-cu129 --dev

💡 You may use torch-cu126 instead if your system requires it, or customize the PyTorch backend in pyproject.toml.

For ROCm users

If you are ROCm device, you need to use this script to install dependencies

git clone --recurse-submodules https://github.com/leonsama/chirrup.git
uv venv --python 3.14t
source .venv/bin/activate
uv sync --extra dev
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.4

🌐 Start API Service

Quick Start

# Currently, `triton._C.libtriton` doesn't declare itself GIL-safe, but it actually works fine—so we
# manually disable the GIL with `PYTHON_GIL=0`.
PYTHON_GIL=0 uv run --frozen python -m chirrup.web_service.app --model_path /path/to/your/model

The service will start at http://127.0.0.1:8000, providing OpenAI-compatible API endpoints.

📖 Detailed Documentation: Check Chirrup API Documentation for complete command-line parameters and API interface documentation.

🧪 Run Demos

Stream Output (Single Request)

Demo:

PYTHON_GIL=0 uv run --frozen test/demo_stream_output.py --model_path /path/to/your/model

Code Example:

from chirrup.engine_core import AsyncEngineCore
from chirrup.core_structure import ModelLoadConfig

model_config = ModelLoadConfig(
    model_path=model_path,
    vocab_path="../Albatross/reference/rwkv_vocab_v20230424.txt",
    vocab_size=65536,
    head_size=64,
)

engine_core = AsyncEngineCore()
await engine_core.init(worker_num=1, model_config=model_config, batch_size=4)

prompt = "User: 为什么 42 是一个有趣的数字？\n\nAssistant:"
completion = engine_core.completion(prompt)

print(prompt, end="", flush=True)
async for event in completion:
    if event[0] == "token":
        print(event[2], end="", flush=True)

Batch Inference (Concurrent Requests)

Demo:

PPYTHON_GIL=0 v run --frozen test/demo_batch_output.py --model_path /path/to/your/model --batch_size 32 --task_num 512 --worker_num 4

Code Example:

from chirrup.engine_core import AsyncEngineCore
from chirrup.core_structure import ModelLoadConfig
import asyncio

model_config = ModelLoadConfig(
    model_path=model_path,
    vocab_path="../Albatross/reference/rwkv_vocab_v20230424.txt",
    vocab_size=65536,
    head_size=64,
)

engine_core = AsyncEngineCore()
await engine_core.init(worker_num=4, model_config=model_config, batch_size=33)  # batch_size = max_batch + 1

prompts = [f"User: 为什么 {i} 是一个有趣的数字？\n\nAssistant: <think>\n</think>" for i in range(512)]

results = await asyncio.gather(
    *[engine_core.completion(prompt).get_full_completion() for prompt in prompts]
)

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

🙏 Acknowledgments

Thanks to RWKV-Vibe/rwkv_lightning for inspiration and to its author Alic for valuable guidance.
Thanks to Jellyfish for the continuous batching implementation in Albatross.

🐦 Like a chirping bird — lightweight, fast, and always responsive.

_{Built with ❤️ for the RWKV ecosystem}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
Albatross		Albatross
Docs		Docs
assets		assets
chirrup		chirrup
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chirrup

📊 Performance

November 12, 2025

✨ Features

✅ Implemented

🔜 Planned

🚀 Getting Started

1. Download a Model

2. Set Up Environment

For ROCm users

🌐 Start API Service

Quick Start

🧪 Run Demos

Stream Output (Single Request)

Batch Inference (Concurrent Requests)

🤝 Contributing

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

leonsama/chirrup

Folders and files

Latest commit

History

Repository files navigation

Chirrup

📊 Performance

November 12, 2025

✨ Features

✅ Implemented

🔜 Planned

🚀 Getting Started

1. Download a Model

2. Set Up Environment

For ROCm users

🌐 Start API Service

Quick Start

🧪 Run Demos

Stream Output (Single Request)

Batch Inference (Concurrent Requests)

🤝 Contributing

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages