Butterfly

Butterfly Network Attention (bna) is a training-free sparse-attention runtime for long-context inference. It is aimed at engineers who want measurable speed or memory wins without retraining the model.

What this repo contains

A PyTorch package, bna, for sparse-attention research and integration work
CUDA and MLX benchmark scripts for Qwen, GLM, GPT-2, and related paths
Measured benchmark artifacts under benchmarks/, results/, and notes/
Older docs and scripts that still use the legacy names Wayfinder and HCSA

Public naming note: Butterfly / BNA is the current public project name. Wayfinder / HCSA are legacy names still present in deeper docs, scripts, benchmark artifact paths, and archived research material.

Status

Tier	What to trust	Evidence
Validated	GLM-4.7-Flash-4bit on MLX at the public stable profile	docs/FIRST_RELEASE.md
Experimental	Qwen 3.5 CUDA block-sparse path and long-context scaling work	`scripts/bench_qwen35_cuda_wayfinder.py`, `benchmarks/cuda/qwen35_wayfinder/`
Experimental	Qwen 3.5 MLX / Apple Silicon path	`scripts/bench_qwen_consumer_mlx.py`, docs/QWEN35_4B_MLX_BENCHMARK_REPORT.md, `results/benchmarks/`
Research / archive	Older Wayfinder/HCSA docs, prompts, and exploratory runs	`docs/`, `notes/`, `archive/`

If you are new to the project, start from the validated GLM path first. The Qwen work is promising, but it should still be read as active engineering rather than a locked public release.

How it works

Dense causal attention does O(T^2) work per layer. Butterfly replaces that with a bounded sparse pattern over fixed-size token blocks.

At a high level, each block attends to:

its local neighborhood
a small number of deterministic long-range partners
optional global or anchor-style connections, depending on the backend

The exact sparse pattern differs across code paths. Older Wayfinder/HCSA integrations describe this as window + cycle + landmarks; the current Butterfly README uses the simpler butterfly-partner framing. In both cases the goal is the same: keep attention neighborhoods explicit, bounded, and cheap enough to help at long context.

For contributor-facing implementation details, see docs/ARCHITECTURE.md.

Measured evidence

Mathematical validity experiment: staged communication

Before asking whether Butterfly preserves model quality, the minimal topology question is:

does a bounded-degree staged Butterfly schedule actually move information across the causal prefix fast enough to matter?

The repo now includes two CPU-only structural experiments that run directly on the real staged block layout used by the CUDA block-sparse path:

primary support proof: scripts/experiment_butterfly_validity.py
staged-vs-controls support proof (with secondary weighted diagnostics): scripts/experiment_butterfly_staging_validity.py
artifacts:
- results/proof/butterfly_validity/summary.md
- results/proof/butterfly_staging_validity/summary.md

Durable claim checked by this proof surface:

per-layer degree stays bounded
staged Butterfly reaches full causal-prefix support in logarithmic depth (L = ceil(log2 N))
staged Butterfly outperforms local-only and frozen-long-range controls on support expansion

Canonical public topology primitives for this proof surface live in bna.topology.butterfly.

Secondary (non-durable) diagnostics:

weighted surrogate spread/conditioning readouts are reported for context only
those diagnostics are not treated as general mixing guarantees

Current result summary for 8..128 blocks and partner rules xor, bit_reversal, and benes:

Rule	Blocks	Last Block Full Reach	All-Block Prefix Reach	Max Degree	Butterfly Coverage At `log2 N`	Local-Only Coverage At `log2 N`
`xor`	`128`	`7`	`13`	`4`	`1.000`	`0.125`
`bit_reversal`	`128`	`6`	`13`	`4`	`1.000`	`0.125`
`benes`	`128`	`7`	`12`	`4`	`1.000`	`0.125`

Interpretation:

the core communication claim holds at the topology level
the result is about information-flow capacity, not yet about perplexity or downstream quality
this is the right “minimum viable proof” that Butterfly is not just a sparse mask, but a sparse mask with logarithmic-depth communication

Reproduce:

python scripts/experiment_butterfly_validity.py
python scripts/experiment_butterfly_staging_validity.py
pytest -q tests/pytorch/test_wayfinder_topology.py \
  tests/pytorch/test_wayfinder_staging_validity.py \
  tests/pytorch/test_wayfinder_operator_mixing.py

Validated public path: GLM on MLX

The clearest in-repo release evidence today is the GLM-4.7-Flash-4bit stable profile documented in docs/FIRST_RELEASE.md.

At seq_len=8192 and decode_len=32 on the validated MLX path:

Mode	E2E	Prefill	Decode tok/s	Peak memory
Dense	17.15s	16.36s	40.58	20.66 GB
Butterfly	10.56s	9.75s	39.85	20.07 GB
Delta vs dense	-38.44%	-40.38%	-1.79%	-2.85%

That is the safest benchmark slice to cite publicly from this tree today.

Experimental CUDA path: Qwen 3.5 9B

The repo also contains experimental CUDA benchmark results for a Triton block-sparse path on Qwen 3.5 9B, where 8 of 32 layers are replaced and the remaining DeltaNet layers stay untouched.

Context	Dense tok/s	Butterfly tok/s	Top-1 agreement
4,096	—	—	99.88%
8,192	1,651	1,698	—
16,384	—	—	94.44%
32,768	1,585	1,688	—
65,536	1,475	1,724	—
98,304	1,413	1,660	—
131,072	1,365	1,667	—
262,144	1,257	1,712	—

These numbers suggest flatter throughput than dense attention at long context, but this path should still be treated as experimental until the quality and support boundaries are documented as tightly as the GLM release path.

Experimental CUDA path: Qwen 3.5 35B A3B FP8

Context	Dense tok/s	Butterfly tok/s
8,192	931	954
32,768	1,280	1,301
65,536	1,241	1,326
131,072	1,131	1,331
163,840	—	1,306
196,608	—	1,364
229,376	—	1,233

Experimental Apple Silicon path: Qwen 3.5 9B on M4 Max

MLX permute-window path with K6 fused Metal kernel, window=64. 8 of 32 attention layers are replaced. Model: mlx-community/Qwen3.5-9B-MLX-4bit.

Context	Dense TTFT	Butterfly TTFT	Dense tok/s	Butterfly tok/s	Peak memory
2,048	71 ms	49 ms	62.2	62.0	7.1 GB
8,192	116 ms	86 ms	57.2	58.8	9.9 GB
32,768	100 ms	99 ms	49.6	47.1	13.7 GB
65,536	160 ms	202 ms	41.5	39.8	18.9 GB
98,304	2.0 s	1.2 s	17.2	22.4	24.0 GB
131,072	6.9 s	7.5 s	7.3	6.8	29.1 GB
163,840	26.8 s	21.5 s	2.2	2.7	34.2 GB

This MLX path uses chunked-gather plus native SDPA for prefill and a fused Metal kernel for decode. It shows wins at short context and again near the memory wall, but it is still an experimental path rather than a validated public release.

Top-1 agreement in the Qwen 9B experiments is 99.88% at 4K and 94.44% at 16K. Perplexity and downstream evaluation are still in progress, so avoid treating these tables as universal quality-parity claims.

Quick start

CUDA (NVIDIA GPU)

git clone https://github.com/Hmbown/Butterfly.git
cd Butterfly
pip install -e ".[dev,kernels]"

Validated public path:

./scripts/run_public_stable_profile_glm.sh

Experimental Qwen CUDA benchmark:

python scripts/bench_qwen35_cuda_wayfinder.py \
    --model-path <path-to-Qwen3.5-9B> \
    --path block_sparse \
    --engine triton \
    --block-size 128 \
    --seq-lens 4096 8192 16384 32768

MLX (Apple Silicon)

git clone https://github.com/Hmbown/Butterfly.git
cd Butterfly
pip install -e ".[mlx]"
pip install mlx-lm zmlx

Environment check:

python scripts/env_check_mlx.py

Experimental Qwen MLX benchmark:

python scripts/bench_qwen_consumer_mlx.py \
    --model-path mlx-community/Qwen3.5-9B-MLX-4bit \
    --mode wayfinder \
    --seq-lens 2048 8192 32768 \
    --decode-len 256 \
    --repeats 3 \
    --out-dir results/benchmarks/my_run

The --mode dense flag runs the stock attention baseline for comparison. Add --skip-quality to benchmark only throughput.

Optional MLX-native KV-cache trial for decode-path evaluation:

python scripts/bench_qwen_consumer_mlx.py \
    --model-path /Volumes/VIXinSSD/models/Qwen3.5-4B-MLX-4bit \
    --mode butterfly \
    --butterfly-decode-backend stock \
    --seq-lens 2048 8192 \
    --decode-len 8 \
    --repeats 1 \
    --chunk-size 384 \
    --query-chunk-size 384 \
    --kv-bits 4 \
    --kv-group-size 64 \
    --quantized-kv-start 0 \
    --skip-multi-turn \
    --skip-quality \
    --hf-offline \
    --out-dir results/benchmarks/qwen35_4b_mlx/kv4_trial

Notes:

In --mode butterfly, keep --chunk-size <= --query-chunk-size. The benchmark now rejects invalid settings because later prefill chunks would otherwise fall back to stock attention.
The MLX KV quantization prototype reuses MLX-LM cache quantization on the full-attention layers only. Butterfly prefill remains dense; the working KV cache is quantized after prefill and before stock decode.
Current Qwen 3.5 4B MLX interpretation and reporting package: docs/QWEN35_4B_MLX_BENCHMARK_REPORT.md

Basic checks

pytest
ruff check bna tests

Repo map

Path	What it is
`bna/`	Core package and backend integrations
`scripts/`	Benchmarks, diagnostics, serving helpers, and figure generation
`docs/`	Contributor-facing architecture, release evidence, and research notes
`benchmarks/`, `results/`	Raw benchmark outputs and summaries
`notes/`	Lab notebook, experiment log, handoff prompts, and planning material
`archive/`	Older exploratory code and preserved artifacts

Where to read next

docs/FIRST_RELEASE.md: validated benchmark slice and reproduction commands
docs/QWEN35_4B_MLX_BENCHMARK_REPORT.md: Butterfly-first Qwen 3.5 4B MLX benchmark report and long-context interpretation
docs/ARCHITECTURE.md: contributor-facing implementation map
docs/APPLE_SILICON_SETUP.md: Apple Silicon bootstrap, llama.cpp Metal baseline, model catalog
CONTRIBUTING.md: expectations for docs, claims, and performance changes

Related work

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github		.github
archive		archive
benchmarks		benchmarks
bna		bna
configs		configs
data		data
discover_sessions		discover_sessions
docs		docs
examples		examples
hcsa		hcsa
notes		notes
results		results
runs		runs
scripts		scripts
tests		tests
wayfinder_nemotron3super_cuda_overlay		wayfinder_nemotron3super_cuda_overlay
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
INTEGRATION_TESTING_PROMPT.txt		INTEGRATION_TESTING_PROMPT.txt
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Butterfly

What this repo contains

Status

How it works

Measured evidence

Mathematical validity experiment: staged communication

Validated public path: GLM on MLX

Experimental CUDA path: Qwen 3.5 9B

Experimental CUDA path: Qwen 3.5 35B A3B FP8

Experimental Apple Silicon path: Qwen 3.5 9B on M4 Max

Quick start

CUDA (NVIDIA GPU)

MLX (Apple Silicon)

Basic checks

Repo map

Where to read next

Related work

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Butterfly

What this repo contains

Status

How it works

Measured evidence

Mathematical validity experiment: staged communication

Validated public path: GLM on MLX

Experimental CUDA path: Qwen 3.5 9B

Experimental CUDA path: Qwen 3.5 35B A3B FP8

Experimental Apple Silicon path: Qwen 3.5 9B on M4 Max

Quick start

CUDA (NVIDIA GPU)

MLX (Apple Silicon)

Basic checks

Repo map

Where to read next

Related work

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages