Skip to content

Leeaandrob/neurogrid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

408 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeuroGrid

Distributed LLM inference runtime for sovereign private AI.

Run an 8-billion-parameter language model in 1.75 GB of model weights on a $300 consumer GPU. Zero cloud dependency. Federated across any fleet of heterogeneous GPUs. Built from scratch in Go + CUDA.

Quick start · Headline results · Models · API · Distributed mode · Architecture · Benchmarks


Thesis

LLM inference is becoming the dominant cost of AI deployment. Today's production runtimes (vLLM, SGLang, TGI) assume a single administrative domain: one owner, one cloud, one data-center rack. That model is fast but forces every inference workload through a small number of cloud providers.

NeuroGrid is an inference runtime that treats any GPU fleet — cloud, consumer, on-prem, or federated across organizations — as a unified inference pool. Three concrete properties:

  1. Memory-bound decoding exploited at the extreme. A 1.58-bit ternary weight format (TQ2_0) shrinks an 8B model to 1.75 GB. The bandwidth ceiling on a 616 GB/s RTX 2080 Ti becomes an 8B-parameter ceiling, not a 350M-parameter one.
  2. P2P layer sharding without a control plane. libp2p handles peer discovery; a VRAM-aware scheduler assigns layers proportional to each peer's free memory. Heterogeneous clusters (GH200 + RTX 4090 + RTX 2080) work without manual partitioning.
  3. Native GGUF + CUDA kernels from first principles. No llama.cpp dependency, no Python quantization servers. A single static binary.

The repository is simultaneously (a) a working inference engine, (b) a reference implementation of 1.58-bit ternary GEMM on CUDA (sm_75, sm_89, sm_90), and (c) the execution artifact behind a research program on sovereign LLM infrastructure.

Headline results

Model Size Hardware Throughput Quality
Ternary-Bonsai-8B (Q2_0) 1.75 GB RTX 2080 Ti (2018, $300 used) 17 tok/s IFEval 81.8 · MuSR 56.2
Ternary-Bonsai-8B (Q2_0) 1.75 GB RTX 4090 47 tok/s same weights
Ternary-Bonsai-1.7B (Q2_0) 0.43 GB RTX 2080 Ti ~60 tok/s 10/10 golden prompts match llama.cpp
Ternary-Bonsai-1.7B (Q2_0) 0.43 GB RTX 2080 Ti PPL 22.66 on WikiText-103 (F16 ref: 18.61)

For comparison, Qwen3-8B in FP16 is 16.38 GB and does not fit on a 2080 Ti. Ternary-Bonsai-8B is 9.4× smaller, loses only 4.8 average points on the Open LLM Leaderboard v2, and beats Qwen3-8B FP16 outright on IFEval (81.8 vs 81.5) and MuSR (56.2 vs 55.0).

Published benchmark scores are from PrismML; the throughput and PPL numbers above were measured on this engine. See docs/HARDWARE_TARGETS_BONSAI.md for roofline analysis and RELEASE.md for per-version detail.

Quick start

# 1. Clone and build (CUDA 12+, Go 1.21+)
git clone https://github.com/leeaandrob/neurogrid && cd neurogrid
make

# 2. Download a ternary model from HuggingFace
huggingface-cli download prism-ml/Ternary-Bonsai-8B-GGUF \
  Ternary-Bonsai-8B-Q2_0.gguf \
  --local-dir ./models/ternary-bonsai-8b-gguf

# 3. Run (OpenAI-compatible API on port 8766)
LD_LIBRARY_PATH=./build:/usr/local/cuda/lib64 \
  ./build/neurogrid \
  -model ./models/ternary-bonsai-8b-gguf/Ternary-Bonsai-8B-Q2_0.gguf \
  -model-name bonsai-8b \
  -http-port 8766 \
  -max-seq-len 4096

# 4. Query
curl http://localhost:8766/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"bonsai-8b","messages":[{"role":"user","content":"What is 15*37?"}],"max_tokens":400,"temperature":0}'

Auto-detected model types: bonsai-1.7b, bonsai-8b, qwen3-8b, lfm2-1.2b-thinking, llama-7b, mistral-7b, gemma-4-31b, tinyllama. -model-name overrides auto-detection.

Models

The engine supports three broad classes of weights:

GGUF (single-file, ternary + standard precisions):

Model Weight format File size Hardware floor
Ternary-Bonsai-1.7B TQ2_0 (1.58-bit) 0.43 GB GTX 1060 (6 GB)
Ternary-Bonsai-8B TQ2_0 (1.58-bit) 1.75 GB GTX 1080 Ti (11 GB)
Qwen3-8B Q2_0 / Q4_K_M 1.8–5.2 GB depends on quant

HuggingFace native (config + safetensors + tokenizer):

Model Weight format VRAM Status
LFM2.5-1.2B-Thinking BF16 native ~4 GB 279 tok/s on RTX 4090
Qwen2.5-7B-Instruct BF16 / AWQ INT4 15 GB / 5 GB Validated
Gemma 4 31B-it BF16 ~65 GB Validated on GH200
Llama 2 / 3 / 3.1 BF16 / FP16 varies Tested
Mistral 7B / Nemo BF16 / INT8 14–25 GB Tested

Generic HuggingFace download:

make download REPO=Qwen/Qwen2.5-7B-Instruct
make download REPO=mistralai/Mistral-Nemo-Instruct-2407

# Gated models
export HF_TOKEN=your_token
make download REPO=meta-llama/Llama-3.3-70B-Instruct

API

OpenAI-compatible /v1/chat/completions (streaming SSE + non-streaming), /v1/completions with full logprobs support (for evaluation harnesses), /v1/models, /health, /metrics (Prometheus).

curl http://localhost:8766/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "bonsai-8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Capital of France?"}
    ],
    "max_tokens": 50,
    "temperature": 0,
    "stream": false
  }'

Thinking models (LFM2, Qwen3 with thinking enabled, Bonsai-8B) expose reasoning_content on the response message alongside content.

Distributed mode

Pipeline parallelism over libp2p, VRAM-aware layer assignment, per-layer KV cache on each worker.

# Coordinator (GH200 — start first)
./build/neurogrid \
  -model models/mistral-nemo-instruct-2407 \
  -http-port 8090 -p2p-port 9000 \
  -min-peers 2 \
  -skip-weight-transfer -disable-mdns

# Worker (RTX 4090)
./build/worker \
  -bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
  -model models/mistral-nemo-instruct-2407 \
  -port 9001 \
  -wait-for-assignment

# Worker (RTX 2080 Ti)
./build/worker \
  -bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
  -model models/mistral-nemo-instruct-2407 \
  -port 9002 \
  -wait-for-assignment

Each worker reports its free VRAM to the coordinator on connect; the scheduler assigns contiguous layer ranges proportional to available memory. Heterogeneous clusters work without manual partitioning.

See docs/architecture/p2p-networking.md for the protocol, docs/architecture/transport-layer.md for activation exchange, and docs/architecture/multi-device-context.md for multi-GPU memory management.

Architecture

neurogrid/
├── cmd/neurogrid/           Coordinator binary (HTTP + P2P)
├── cmd/worker/              Worker binary (layer execution)
├── cmd/bonsai-e2e/          Golden-set correctness harness
├── cmd/bonsai-ppl/          Strided perplexity evaluation
├── gpu/cuda/                CUDA kernels (attention, RoPE, ternary GEMM,
│                            paged attention, flash attn v2, ...)
├── gpu/engine/              Full forward-pass kernels (layer, decode_all)
├── gpu/bindings/            Go ↔ CUDA FFI (cgo)
├── pkg/inference/           Engine, batch scheduler, paged KV cache,
│                            sampler, generation loop
├── pkg/model/               GGUF parser, tokenizers (BPE, SentencePiece),
│                            chat templates, weight loaders
├── pkg/scheduler/           VRAM-aware layer assignment, pipeline
│                            parallelism
├── api/                     OpenAI-compatible HTTP layer (SSE streaming,
│                            reasoning_content, logprobs)
├── p2p/                     libp2p discovery + tensor transfer protocol
├── benchmarks/bonsai/       Roofline-validated throughput benchmark
├── scripts/bonsai/          Golden set + PPL scripts
└── docs/                    Architecture docs, ADRs, kernel design notes

Benchmarks

Three reproducible harnesses ship with the repo:

  1. Throughputbenchmarks/bonsai/ (Go). Compares llama.cpp-prism baseline against NeuroGrid and validates against the theoretical memory-bandwidth roofline.
  2. Correctnesscmd/bonsai-e2e. Runs 10 golden prompts with temp=0, compares token-level output against the llama.cpp FP16 GPU reference. Exits 0 only when all prompts match.
  3. Perplexitycmd/bonsai-ppl. Strided WikiText-103 evaluation (stride=256, ctx=512), matching llama.cpp's default methodology.
# Throughput — choose a hardware profile
go run ./benchmarks/bonsai \
    --runtime=neurogrid \
    --hardware=rtx-4090 \
    --model=models/ternary-bonsai-8b-gguf/Ternary-Bonsai-8B-Q2_0.gguf

# Correctness (RTX 2080 Ti or 4090)
make build
./build/bonsai-e2e -model models/ternary-bonsai-1.7b-gguf/Ternary-Bonsai-1.7B-Q2_0.gguf

# PPL
./build/bonsai-ppl -model models/ternary-bonsai-1.7b-gguf/Ternary-Bonsai-1.7B-Q2_0.gguf \
  -stride 256 -ctx 512 -max-windows 20

Requirements

Requirement Version
Go 1.21+
CUDA Toolkit 12.x (13.x on host matches device automatically)
GPU NVIDIA compute capability 7.5+ (RTX 20/30/40/50, GH200)
OS Linux (tested on Fedora 43, Ubuntu 22.04/24.04)

Research positioning

This codebase is the execution artifact for a research program on distributed LLM inference for sovereign private AI. The work is oriented toward three deliverables:

  • A reference implementation of 1.58-bit ternary inference on CUDA with sm_75 / sm_89 / sm_90 paths (this repository).
  • A benchmarking methodology that reports quality + throughput + cost jointly, rather than in isolation.
  • A forthcoming paper positioning distributed ternary inference against vLLM / SGLang baselines on heterogeneous hardware.

If you are a researcher, reviewer, or potential collaborator, the entry points to the technical depth are:

  • docs/refs/CUDA_TERNARY_KERNEL_DESIGN.md — GEMV + MMA design for TQ2_0
  • docs/refs/GGUF_FORMAT_DESIGN.md — native GGUF parsing
  • docs/refs/YARN_SUPPORT_STATUS.md — RoPE scaling implementation
  • docs/architecture/decisions/ — ADRs for major design choices

Citation

A preprint is in preparation. In the interim:

@software{neurogrid2026,
  author = {Barbosa, Leandro},
  title  = {NeuroGrid: Distributed LLM Inference Runtime for Sovereign Private AI},
  year   = {2026},
  url    = {https://github.com/leeaandrob/neurogrid}
}

License

Source Available License with Academic & Educational Use Grant

  • Free for students, researchers, and academic use.
  • Free for personal learning and non-commercial projects.
  • Commercial use requires a license — contact leandrobar93@gmail.com.

Acknowledgments

  • PrismML — Ternary-Bonsai weights
  • llama.cpp — GGUF format + reference kernels
  • cuBLAS — GPU BLAS backend
  • libp2p — P2P networking substrate

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors