Distributed LLM inference runtime for sovereign private AI.
Run an 8-billion-parameter language model in 1.75 GB of model weights on a $300 consumer GPU. Zero cloud dependency. Federated across any fleet of heterogeneous GPUs. Built from scratch in Go + CUDA.
Quick start · Headline results · Models · API · Distributed mode · Architecture · Benchmarks
LLM inference is becoming the dominant cost of AI deployment. Today's production runtimes (vLLM, SGLang, TGI) assume a single administrative domain: one owner, one cloud, one data-center rack. That model is fast but forces every inference workload through a small number of cloud providers.
NeuroGrid is an inference runtime that treats any GPU fleet — cloud, consumer, on-prem, or federated across organizations — as a unified inference pool. Three concrete properties:
- Memory-bound decoding exploited at the extreme. A 1.58-bit ternary weight format (TQ2_0) shrinks an 8B model to 1.75 GB. The bandwidth ceiling on a 616 GB/s RTX 2080 Ti becomes an 8B-parameter ceiling, not a 350M-parameter one.
- P2P layer sharding without a control plane. libp2p handles peer discovery; a VRAM-aware scheduler assigns layers proportional to each peer's free memory. Heterogeneous clusters (GH200 + RTX 4090 + RTX 2080) work without manual partitioning.
- Native GGUF + CUDA kernels from first principles. No llama.cpp dependency, no Python quantization servers. A single static binary.
The repository is simultaneously (a) a working inference engine, (b) a reference implementation of 1.58-bit ternary GEMM on CUDA (sm_75, sm_89, sm_90), and (c) the execution artifact behind a research program on sovereign LLM infrastructure.
| Model | Size | Hardware | Throughput | Quality |
|---|---|---|---|---|
| Ternary-Bonsai-8B (Q2_0) | 1.75 GB | RTX 2080 Ti (2018, $300 used) | 17 tok/s | IFEval 81.8 · MuSR 56.2 |
| Ternary-Bonsai-8B (Q2_0) | 1.75 GB | RTX 4090 | 47 tok/s | same weights |
| Ternary-Bonsai-1.7B (Q2_0) | 0.43 GB | RTX 2080 Ti | ~60 tok/s | 10/10 golden prompts match llama.cpp |
| Ternary-Bonsai-1.7B (Q2_0) | 0.43 GB | RTX 2080 Ti | — | PPL 22.66 on WikiText-103 (F16 ref: 18.61) |
For comparison, Qwen3-8B in FP16 is 16.38 GB and does not fit on a 2080 Ti. Ternary-Bonsai-8B is 9.4× smaller, loses only 4.8 average points on the Open LLM Leaderboard v2, and beats Qwen3-8B FP16 outright on IFEval (81.8 vs 81.5) and MuSR (56.2 vs 55.0).
Published benchmark scores are from PrismML; the throughput and PPL numbers
above were measured on this engine. See
docs/HARDWARE_TARGETS_BONSAI.md for
roofline analysis and RELEASE.md for per-version detail.
# 1. Clone and build (CUDA 12+, Go 1.21+)
git clone https://github.com/leeaandrob/neurogrid && cd neurogrid
make
# 2. Download a ternary model from HuggingFace
huggingface-cli download prism-ml/Ternary-Bonsai-8B-GGUF \
Ternary-Bonsai-8B-Q2_0.gguf \
--local-dir ./models/ternary-bonsai-8b-gguf
# 3. Run (OpenAI-compatible API on port 8766)
LD_LIBRARY_PATH=./build:/usr/local/cuda/lib64 \
./build/neurogrid \
-model ./models/ternary-bonsai-8b-gguf/Ternary-Bonsai-8B-Q2_0.gguf \
-model-name bonsai-8b \
-http-port 8766 \
-max-seq-len 4096
# 4. Query
curl http://localhost:8766/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"bonsai-8b","messages":[{"role":"user","content":"What is 15*37?"}],"max_tokens":400,"temperature":0}'Auto-detected model types: bonsai-1.7b, bonsai-8b, qwen3-8b,
lfm2-1.2b-thinking, llama-7b, mistral-7b, gemma-4-31b,
tinyllama. -model-name overrides auto-detection.
The engine supports three broad classes of weights:
GGUF (single-file, ternary + standard precisions):
| Model | Weight format | File size | Hardware floor |
|---|---|---|---|
| Ternary-Bonsai-1.7B | TQ2_0 (1.58-bit) | 0.43 GB | GTX 1060 (6 GB) |
| Ternary-Bonsai-8B | TQ2_0 (1.58-bit) | 1.75 GB | GTX 1080 Ti (11 GB) |
| Qwen3-8B | Q2_0 / Q4_K_M | 1.8–5.2 GB | depends on quant |
HuggingFace native (config + safetensors + tokenizer):
| Model | Weight format | VRAM | Status |
|---|---|---|---|
| LFM2.5-1.2B-Thinking | BF16 native | ~4 GB | 279 tok/s on RTX 4090 |
| Qwen2.5-7B-Instruct | BF16 / AWQ INT4 | 15 GB / 5 GB | Validated |
| Gemma 4 31B-it | BF16 | ~65 GB | Validated on GH200 |
| Llama 2 / 3 / 3.1 | BF16 / FP16 | varies | Tested |
| Mistral 7B / Nemo | BF16 / INT8 | 14–25 GB | Tested |
Generic HuggingFace download:
make download REPO=Qwen/Qwen2.5-7B-Instruct
make download REPO=mistralai/Mistral-Nemo-Instruct-2407
# Gated models
export HF_TOKEN=your_token
make download REPO=meta-llama/Llama-3.3-70B-InstructOpenAI-compatible /v1/chat/completions (streaming SSE + non-streaming),
/v1/completions with full logprobs support (for evaluation harnesses),
/v1/models, /health, /metrics (Prometheus).
curl http://localhost:8766/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "bonsai-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Capital of France?"}
],
"max_tokens": 50,
"temperature": 0,
"stream": false
}'Thinking models (LFM2, Qwen3 with thinking enabled, Bonsai-8B) expose
reasoning_content on the response message alongside content.
Pipeline parallelism over libp2p, VRAM-aware layer assignment, per-layer KV cache on each worker.
# Coordinator (GH200 — start first)
./build/neurogrid \
-model models/mistral-nemo-instruct-2407 \
-http-port 8090 -p2p-port 9000 \
-min-peers 2 \
-skip-weight-transfer -disable-mdns
# Worker (RTX 4090)
./build/worker \
-bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
-model models/mistral-nemo-instruct-2407 \
-port 9001 \
-wait-for-assignment
# Worker (RTX 2080 Ti)
./build/worker \
-bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
-model models/mistral-nemo-instruct-2407 \
-port 9002 \
-wait-for-assignmentEach worker reports its free VRAM to the coordinator on connect; the scheduler assigns contiguous layer ranges proportional to available memory. Heterogeneous clusters work without manual partitioning.
See docs/architecture/p2p-networking.md
for the protocol, docs/architecture/transport-layer.md
for activation exchange, and
docs/architecture/multi-device-context.md
for multi-GPU memory management.
neurogrid/
├── cmd/neurogrid/ Coordinator binary (HTTP + P2P)
├── cmd/worker/ Worker binary (layer execution)
├── cmd/bonsai-e2e/ Golden-set correctness harness
├── cmd/bonsai-ppl/ Strided perplexity evaluation
├── gpu/cuda/ CUDA kernels (attention, RoPE, ternary GEMM,
│ paged attention, flash attn v2, ...)
├── gpu/engine/ Full forward-pass kernels (layer, decode_all)
├── gpu/bindings/ Go ↔ CUDA FFI (cgo)
├── pkg/inference/ Engine, batch scheduler, paged KV cache,
│ sampler, generation loop
├── pkg/model/ GGUF parser, tokenizers (BPE, SentencePiece),
│ chat templates, weight loaders
├── pkg/scheduler/ VRAM-aware layer assignment, pipeline
│ parallelism
├── api/ OpenAI-compatible HTTP layer (SSE streaming,
│ reasoning_content, logprobs)
├── p2p/ libp2p discovery + tensor transfer protocol
├── benchmarks/bonsai/ Roofline-validated throughput benchmark
├── scripts/bonsai/ Golden set + PPL scripts
└── docs/ Architecture docs, ADRs, kernel design notes
Three reproducible harnesses ship with the repo:
- Throughput —
benchmarks/bonsai/(Go). Compares llama.cpp-prism baseline against NeuroGrid and validates against the theoretical memory-bandwidth roofline. - Correctness —
cmd/bonsai-e2e. Runs 10 golden prompts with temp=0, compares token-level output against the llama.cpp FP16 GPU reference. Exits 0 only when all prompts match. - Perplexity —
cmd/bonsai-ppl. Strided WikiText-103 evaluation (stride=256, ctx=512), matching llama.cpp's default methodology.
# Throughput — choose a hardware profile
go run ./benchmarks/bonsai \
--runtime=neurogrid \
--hardware=rtx-4090 \
--model=models/ternary-bonsai-8b-gguf/Ternary-Bonsai-8B-Q2_0.gguf
# Correctness (RTX 2080 Ti or 4090)
make build
./build/bonsai-e2e -model models/ternary-bonsai-1.7b-gguf/Ternary-Bonsai-1.7B-Q2_0.gguf
# PPL
./build/bonsai-ppl -model models/ternary-bonsai-1.7b-gguf/Ternary-Bonsai-1.7B-Q2_0.gguf \
-stride 256 -ctx 512 -max-windows 20| Requirement | Version |
|---|---|
| Go | 1.21+ |
| CUDA Toolkit | 12.x (13.x on host matches device automatically) |
| GPU | NVIDIA compute capability 7.5+ (RTX 20/30/40/50, GH200) |
| OS | Linux (tested on Fedora 43, Ubuntu 22.04/24.04) |
This codebase is the execution artifact for a research program on distributed LLM inference for sovereign private AI. The work is oriented toward three deliverables:
- A reference implementation of 1.58-bit ternary inference on CUDA with sm_75 / sm_89 / sm_90 paths (this repository).
- A benchmarking methodology that reports quality + throughput + cost jointly, rather than in isolation.
- A forthcoming paper positioning distributed ternary inference against vLLM / SGLang baselines on heterogeneous hardware.
If you are a researcher, reviewer, or potential collaborator, the entry points to the technical depth are:
docs/refs/CUDA_TERNARY_KERNEL_DESIGN.md— GEMV + MMA design for TQ2_0docs/refs/GGUF_FORMAT_DESIGN.md— native GGUF parsingdocs/refs/YARN_SUPPORT_STATUS.md— RoPE scaling implementationdocs/architecture/decisions/— ADRs for major design choices
A preprint is in preparation. In the interim:
@software{neurogrid2026,
author = {Barbosa, Leandro},
title = {NeuroGrid: Distributed LLM Inference Runtime for Sovereign Private AI},
year = {2026},
url = {https://github.com/leeaandrob/neurogrid}
}Source Available License with Academic & Educational Use Grant
- Free for students, researchers, and academic use.
- Free for personal learning and non-commercial projects.
- Commercial use requires a license — contact leandrobar93@gmail.com.