NeuroGrid

Distributed LLM inference runtime for sovereign private AI.

Run an 8-billion-parameter language model in 1.75 GB of model weights on a $300 consumer GPU. Zero cloud dependency. Federated across any fleet of heterogeneous GPUs. Built from scratch in Go + CUDA.

Quick start · Headline results · Models · API · Distributed mode · Architecture · Benchmarks

Thesis

LLM inference is becoming the dominant cost of AI deployment. Today's production runtimes (vLLM, SGLang, TGI) assume a single administrative domain: one owner, one cloud, one data-center rack. That model is fast but forces every inference workload through a small number of cloud providers.

NeuroGrid is an inference runtime that treats any GPU fleet — cloud, consumer, on-prem, or federated across organizations — as a unified inference pool. Three concrete properties:

Memory-bound decoding exploited at the extreme. A 1.58-bit ternary weight format (TQ2_0) shrinks an 8B model to 1.75 GB. The bandwidth ceiling on a 616 GB/s RTX 2080 Ti becomes an 8B-parameter ceiling, not a 350M-parameter one.
P2P layer sharding without a control plane. libp2p handles peer discovery; a VRAM-aware scheduler assigns layers proportional to each peer's free memory. Heterogeneous clusters (GH200 + RTX 4090 + RTX 2080) work without manual partitioning.
Native GGUF + CUDA kernels from first principles. No llama.cpp dependency, no Python quantization servers. A single static binary.

The repository is simultaneously (a) a working inference engine, (b) a reference implementation of 1.58-bit ternary GEMM on CUDA (sm_75, sm_89, sm_90), and (c) the execution artifact behind a research program on sovereign LLM infrastructure.

Headline results

Model	Size	Hardware	Throughput	Quality
Ternary-Bonsai-8B (Q2_0)	1.75 GB	RTX 2080 Ti (2018, $300 used)	17 tok/s	IFEval 81.8 · MuSR 56.2
Ternary-Bonsai-8B (Q2_0)	1.75 GB	RTX 4090	47 tok/s	same weights
Ternary-Bonsai-1.7B (Q2_0)	0.43 GB	RTX 2080 Ti	~60 tok/s	10/10 golden prompts match llama.cpp
Ternary-Bonsai-1.7B (Q2_0)	0.43 GB	RTX 2080 Ti	—	PPL 22.66 on WikiText-103 (F16 ref: 18.61)

For comparison, Qwen3-8B in FP16 is 16.38 GB and does not fit on a 2080 Ti. Ternary-Bonsai-8B is 9.4× smaller, loses only 4.8 average points on the Open LLM Leaderboard v2, and beats Qwen3-8B FP16 outright on IFEval (81.8 vs 81.5) and MuSR (56.2 vs 55.0).

Published benchmark scores are from PrismML; the throughput and PPL numbers above were measured on this engine. See docs/HARDWARE_TARGETS_BONSAI.md for roofline analysis and RELEASE.md for per-version detail.

Quick start

# 1. Clone and build (CUDA 12+, Go 1.21+)
git clone https://github.com/leeaandrob/neurogrid && cd neurogrid
make

# 2. Download a ternary model from HuggingFace
huggingface-cli download prism-ml/Ternary-Bonsai-8B-GGUF \
  Ternary-Bonsai-8B-Q2_0.gguf \
  --local-dir ./models/ternary-bonsai-8b-gguf

# 3. Run (OpenAI-compatible API on port 8766)
LD_LIBRARY_PATH=./build:/usr/local/cuda/lib64 \
  ./build/neurogrid \
  -model ./models/ternary-bonsai-8b-gguf/Ternary-Bonsai-8B-Q2_0.gguf \
  -model-name bonsai-8b \
  -http-port 8766 \
  -max-seq-len 4096

# 4. Query
curl http://localhost:8766/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"bonsai-8b","messages":[{"role":"user","content":"What is 15*37?"}],"max_tokens":400,"temperature":0}'

Auto-detected model types: bonsai-1.7b, bonsai-8b, qwen3-8b, lfm2-1.2b-thinking, llama-7b, mistral-7b, gemma-4-31b, tinyllama. -model-name overrides auto-detection.

Models

The engine supports three broad classes of weights:

GGUF (single-file, ternary + standard precisions):

Model	Weight format	File size	Hardware floor
Ternary-Bonsai-1.7B	TQ2_0 (1.58-bit)	0.43 GB	GTX 1060 (6 GB)
Ternary-Bonsai-8B	TQ2_0 (1.58-bit)	1.75 GB	GTX 1080 Ti (11 GB)
Qwen3-8B	Q2_0 / Q4_K_M	1.8–5.2 GB	depends on quant

HuggingFace native (config + safetensors + tokenizer):

Model	Weight format	VRAM	Status
LFM2.5-1.2B-Thinking	BF16 native	~4 GB	279 tok/s on RTX 4090
Qwen2.5-7B-Instruct	BF16 / AWQ INT4	15 GB / 5 GB	Validated
Gemma 4 31B-it	BF16	~65 GB	Validated on GH200
Llama 2 / 3 / 3.1	BF16 / FP16	varies	Tested
Mistral 7B / Nemo	BF16 / INT8	14–25 GB	Tested

Generic HuggingFace download:

make download REPO=Qwen/Qwen2.5-7B-Instruct
make download REPO=mistralai/Mistral-Nemo-Instruct-2407

# Gated models
export HF_TOKEN=your_token
make download REPO=meta-llama/Llama-3.3-70B-Instruct

API

OpenAI-compatible /v1/chat/completions (streaming SSE + non-streaming), /v1/completions with full logprobs support (for evaluation harnesses), /v1/models, /health, /metrics (Prometheus).

curl http://localhost:8766/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "bonsai-8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Capital of France?"}
    ],
    "max_tokens": 50,
    "temperature": 0,
    "stream": false
  }'

Thinking models (LFM2, Qwen3 with thinking enabled, Bonsai-8B) expose reasoning_content on the response message alongside content.

Distributed mode

Pipeline parallelism over libp2p, VRAM-aware layer assignment, per-layer KV cache on each worker.

# Coordinator (GH200 — start first)
./build/neurogrid \
  -model models/mistral-nemo-instruct-2407 \
  -http-port 8090 -p2p-port 9000 \
  -min-peers 2 \
  -skip-weight-transfer -disable-mdns

# Worker (RTX 4090)
./build/worker \
  -bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
  -model models/mistral-nemo-instruct-2407 \
  -port 9001 \
  -wait-for-assignment

# Worker (RTX 2080 Ti)
./build/worker \
  -bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
  -model models/mistral-nemo-instruct-2407 \
  -port 9002 \
  -wait-for-assignment

Each worker reports its free VRAM to the coordinator on connect; the scheduler assigns contiguous layer ranges proportional to available memory. Heterogeneous clusters work without manual partitioning.

See docs/architecture/p2p-networking.md for the protocol, docs/architecture/transport-layer.md for activation exchange, and docs/architecture/multi-device-context.md for multi-GPU memory management.

Architecture

neurogrid/
├── cmd/neurogrid/           Coordinator binary (HTTP + P2P)
├── cmd/worker/              Worker binary (layer execution)
├── cmd/bonsai-e2e/          Golden-set correctness harness
├── cmd/bonsai-ppl/          Strided perplexity evaluation
├── gpu/cuda/                CUDA kernels (attention, RoPE, ternary GEMM,
│                            paged attention, flash attn v2, ...)
├── gpu/engine/              Full forward-pass kernels (layer, decode_all)
├── gpu/bindings/            Go ↔ CUDA FFI (cgo)
├── pkg/inference/           Engine, batch scheduler, paged KV cache,
│                            sampler, generation loop
├── pkg/model/               GGUF parser, tokenizers (BPE, SentencePiece),
│                            chat templates, weight loaders
├── pkg/scheduler/           VRAM-aware layer assignment, pipeline
│                            parallelism
├── api/                     OpenAI-compatible HTTP layer (SSE streaming,
│                            reasoning_content, logprobs)
├── p2p/                     libp2p discovery + tensor transfer protocol
├── benchmarks/bonsai/       Roofline-validated throughput benchmark
├── scripts/bonsai/          Golden set + PPL scripts
└── docs/                    Architecture docs, ADRs, kernel design notes

Benchmarks

Three reproducible harnesses ship with the repo:

Throughput — benchmarks/bonsai/ (Go). Compares llama.cpp-prism baseline against NeuroGrid and validates against the theoretical memory-bandwidth roofline.
Correctness — cmd/bonsai-e2e. Runs 10 golden prompts with temp=0, compares token-level output against the llama.cpp FP16 GPU reference. Exits 0 only when all prompts match.
Perplexity — cmd/bonsai-ppl. Strided WikiText-103 evaluation (stride=256, ctx=512), matching llama.cpp's default methodology.

# Throughput — choose a hardware profile
go run ./benchmarks/bonsai \
    --runtime=neurogrid \
    --hardware=rtx-4090 \
    --model=models/ternary-bonsai-8b-gguf/Ternary-Bonsai-8B-Q2_0.gguf

# Correctness (RTX 2080 Ti or 4090)
make build
./build/bonsai-e2e -model models/ternary-bonsai-1.7b-gguf/Ternary-Bonsai-1.7B-Q2_0.gguf

# PPL
./build/bonsai-ppl -model models/ternary-bonsai-1.7b-gguf/Ternary-Bonsai-1.7B-Q2_0.gguf \
  -stride 256 -ctx 512 -max-windows 20

Requirements

Requirement	Version
Go	1.21+
CUDA Toolkit	12.x (13.x on host matches device automatically)
GPU	NVIDIA compute capability 7.5+ (RTX 20/30/40/50, GH200)
OS	Linux (tested on Fedora 43, Ubuntu 22.04/24.04)

Research positioning

This codebase is the execution artifact for a research program on distributed LLM inference for sovereign private AI. The work is oriented toward three deliverables:

A reference implementation of 1.58-bit ternary inference on CUDA with sm_75 / sm_89 / sm_90 paths (this repository).
A benchmarking methodology that reports quality + throughput + cost jointly, rather than in isolation.
A forthcoming paper positioning distributed ternary inference against vLLM / SGLang baselines on heterogeneous hardware.

If you are a researcher, reviewer, or potential collaborator, the entry points to the technical depth are:

docs/refs/CUDA_TERNARY_KERNEL_DESIGN.md — GEMV + MMA design for TQ2_0
docs/refs/GGUF_FORMAT_DESIGN.md — native GGUF parsing
docs/refs/YARN_SUPPORT_STATUS.md — RoPE scaling implementation
docs/architecture/decisions/ — ADRs for major design choices

Citation

A preprint is in preparation. In the interim:

@software{neurogrid2026,
  author = {Barbosa, Leandro},
  title  = {NeuroGrid: Distributed LLM Inference Runtime for Sovereign Private AI},
  year   = {2026},
  url    = {https://github.com/leeaandrob/neurogrid}
}

License

Source Available License with Academic & Educational Use Grant

Free for students, researchers, and academic use.
Free for personal learning and non-commercial projects.
Commercial use requires a license — contact leandrobar93@gmail.com.

Acknowledgments

PrismML — Ternary-Bonsai weights
llama.cpp — GGUF format + reference kernels
cuBLAS — GPU BLAS backend
libp2p — P2P networking substrate

Name		Name	Last commit message	Last commit date
Latest commit History 408 Commits
api		api
benchmarks		benchmarks
cmd		cmd
configs		configs
docs		docs
gpu		gpu
observability		observability
p2p		p2p
pkg		pkg
profiles		profiles
results/bonsai-8b-phase1/bonsai-8b		results/bonsai-8b-phase1/bonsai-8b
schemas		schemas
scripts		scripts
tests		tests
third_party		third_party
tools/benchmark		tools/benchmark
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CUDA_RESEARCHER.md		CUDA_RESEARCHER.md
LICENSE		LICENSE
Makefile		Makefile
PAPER.md		PAPER.md
README.md		README.md
RELEASE.md		RELEASE.md
ROADMAP.md		ROADMAP.md
autoresearch-results.tsv		autoresearch-results.tsv
docker-compose.observability.yml		docker-compose.observability.yml
go.mod		go.mod
go.sum		go.sum
worker		worker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuroGrid

Thesis

Headline results

Quick start

Models

API

Distributed mode

Architecture

Benchmarks

Requirements

Research positioning

Citation

License

Acknowledgments

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NeuroGrid

Thesis

Headline results

Quick start

Models

API

Distributed mode

Architecture

Benchmarks

Requirements

Research positioning

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages