Skip to content

leonsama/chirrup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chirrup Logo

Chirrup

/ˈCHirəp/ — (especially of a small bird) make repeated short high-pitched sounds; twitter.

Chirrup is a high-performance inference frontend for RWKV models, built on top of Albatross.


📊 Performance

November 12, 2025

GPU Configuration Model Workers BSZ/Worker Total Concurrent Requests TPS per Request
4 × RTX 4090 24GB 7.2B 4 200 800 16 tps
4 × Tesla V100 16GB 7.2B 4 34 136 17 tps

Note: The RTX 4090 configuration is far from the GPU's processing limits, with significant optimization potential remaining.

✨ Features

✅ Implemented

  • High Performance: Leverages the blazing-fast inference engine from Albatross.
  • Continuous Batching: Maximizes GPU utilization by dynamically batching incoming requests.
  • State Cache: Reuses computation states for long-context inputs, significantly improving throughput as context length increases.
  • OpenAI-Compatible API: Drop-in replacement for existing LLM workflows — no code changes needed.

🔜 Planned

  • CUDA Graph support for reduced kernel launch overhead
  • Prefill-Decode separation for optimized scheduling
  • Constrained decoding (e.g., JSON schema)
  • Function Calling support
  • Pipeline parallelism to enable inference of even larger models

🚀 Getting Started

1. Download a Model

Visit the official model hub and download a RWKV-7 g1 series model that fits your needs:
👉 https://huggingface.co/BlinkDL/rwkv7-g1/tree/main

2. Set Up Environment

For best performance, we strongly recommend using Python 3.14t (Free threading) via uv.

# Clone the repository
git clone --recurse-submodules https://github.com/leonsama/chirrup.git

# Create a Python 3.14t virtual environment
uv venv --python 3.14t

# Activate it
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate     # Windows

# Install Chirrup
uv pip install -e .

# Install dependencies with CUDA 12.9 support and dev tools
uv sync --extra torch-cu129 --dev

💡 You may use torch-cu126 instead if your system requires it, or customize the PyTorch backend in pyproject.toml.

For ROCm users

If you are ROCm device, you need to use this script to install dependencies

git clone --recurse-submodules https://github.com/leonsama/chirrup.git
uv venv --python 3.14t
source .venv/bin/activate
uv sync --extra dev
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.4

🌐 Start API Service

Quick Start

# Currently, `triton._C.libtriton` doesn't declare itself GIL-safe, but it actually works fine—so we
# manually disable the GIL with `PYTHON_GIL=0`.
PYTHON_GIL=0 uv run --frozen python -m chirrup.web_service.app --model_path /path/to/your/model

The service will start at http://127.0.0.1:8000, providing OpenAI-compatible API endpoints.

📖 Detailed Documentation: Check Chirrup API Documentation for complete command-line parameters and API interface documentation.


🧪 Run Demos

Stream Output (Single Request)

Demo:

PYTHON_GIL=0 uv run --frozen test/demo_stream_output.py --model_path /path/to/your/model

Code Example:

from chirrup.engine_core import AsyncEngineCore
from chirrup.core_structure import ModelLoadConfig

model_config = ModelLoadConfig(
    model_path=model_path,
    vocab_path="../Albatross/reference/rwkv_vocab_v20230424.txt",
    vocab_size=65536,
    head_size=64,
)

engine_core = AsyncEngineCore()
await engine_core.init(worker_num=1, model_config=model_config, batch_size=4)

prompt = "User: 为什么 42 是一个有趣的数字?\n\nAssistant:"
completion = engine_core.completion(prompt)

print(prompt, end="", flush=True)
async for event in completion:
    if event[0] == "token":
        print(event[2], end="", flush=True)

Batch Inference (Concurrent Requests)

Demo:

PPYTHON_GIL=0 v run --frozen test/demo_batch_output.py --model_path /path/to/your/model --batch_size 32 --task_num 512 --worker_num 4

Code Example:

from chirrup.engine_core import AsyncEngineCore
from chirrup.core_structure import ModelLoadConfig
import asyncio

model_config = ModelLoadConfig(
    model_path=model_path,
    vocab_path="../Albatross/reference/rwkv_vocab_v20230424.txt",
    vocab_size=65536,
    head_size=64,
)

engine_core = AsyncEngineCore()
await engine_core.init(worker_num=4, model_config=model_config, batch_size=33)  # batch_size = max_batch + 1

prompts = [f"User: 为什么 {i} 是一个有趣的数字?\n\nAssistant: <think>\n</think>" for i in range(512)]

results = await asyncio.gather(
    *[engine_core.completion(prompt).get_full_completion() for prompt in prompts]
)

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

🙏 Acknowledgments


🐦 Like a chirping bird — lightweight, fast, and always responsive.
Built with ❤️ for the RWKV ecosystem

About

🐦RWKV-7 inference | Like a chirping bird — lightweight, fast, and always responsive.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •