Butterfly Network Attention (bna) is a training-free sparse-attention runtime for long-context inference. It is aimed at engineers who want measurable speed or memory wins without retraining the model.
- A PyTorch package,
bna, for sparse-attention research and integration work - CUDA and MLX benchmark scripts for Qwen, GLM, GPT-2, and related paths
- Measured benchmark artifacts under
benchmarks/,results/, andnotes/ - Older docs and scripts that still use the legacy names
WayfinderandHCSA
Public naming note: Butterfly / BNA is the current public project name. Wayfinder / HCSA are legacy names still present in deeper docs, scripts, benchmark artifact paths, and archived research material.
| Tier | What to trust | Evidence |
|---|---|---|
| Validated | GLM-4.7-Flash-4bit on MLX at the public stable profile | docs/FIRST_RELEASE.md |
| Experimental | Qwen 3.5 CUDA block-sparse path and long-context scaling work | scripts/bench_qwen35_cuda_wayfinder.py, benchmarks/cuda/qwen35_wayfinder/ |
| Experimental | Qwen 3.5 MLX / Apple Silicon path | scripts/bench_qwen_consumer_mlx.py, docs/QWEN35_4B_MLX_BENCHMARK_REPORT.md, results/benchmarks/ |
| Research / archive | Older Wayfinder/HCSA docs, prompts, and exploratory runs | docs/, notes/, archive/ |
If you are new to the project, start from the validated GLM path first. The Qwen work is promising, but it should still be read as active engineering rather than a locked public release.
Dense causal attention does O(T^2) work per layer. Butterfly replaces that with a bounded sparse pattern over fixed-size token blocks.
At a high level, each block attends to:
- its local neighborhood
- a small number of deterministic long-range partners
- optional global or anchor-style connections, depending on the backend
The exact sparse pattern differs across code paths. Older Wayfinder/HCSA integrations describe this as window + cycle + landmarks; the current Butterfly README uses the simpler butterfly-partner framing. In both cases the goal is the same: keep attention neighborhoods explicit, bounded, and cheap enough to help at long context.
For contributor-facing implementation details, see docs/ARCHITECTURE.md.
Before asking whether Butterfly preserves model quality, the minimal topology question is:
- does a bounded-degree staged Butterfly schedule actually move information across the causal prefix fast enough to matter?
The repo now includes two CPU-only structural experiments that run directly on the real staged block layout used by the CUDA block-sparse path:
- primary support proof: scripts/experiment_butterfly_validity.py
- staged-vs-controls support proof (with secondary weighted diagnostics): scripts/experiment_butterfly_staging_validity.py
- artifacts:
Durable claim checked by this proof surface:
- per-layer degree stays bounded
- staged Butterfly reaches full causal-prefix support in logarithmic depth (
L = ceil(log2 N)) - staged Butterfly outperforms local-only and frozen-long-range controls on support expansion
Canonical public topology primitives for this proof surface live in bna.topology.butterfly.
Secondary (non-durable) diagnostics:
- weighted surrogate spread/conditioning readouts are reported for context only
- those diagnostics are not treated as general mixing guarantees
Current result summary for 8..128 blocks and partner rules xor, bit_reversal, and benes:
| Rule | Blocks | Last Block Full Reach | All-Block Prefix Reach | Max Degree | Butterfly Coverage At log2 N |
Local-Only Coverage At log2 N |
|---|---|---|---|---|---|---|
xor |
128 |
7 |
13 |
4 |
1.000 |
0.125 |
bit_reversal |
128 |
6 |
13 |
4 |
1.000 |
0.125 |
benes |
128 |
7 |
12 |
4 |
1.000 |
0.125 |
Interpretation:
- the core communication claim holds at the topology level
- the result is about information-flow capacity, not yet about perplexity or downstream quality
- this is the right “minimum viable proof” that Butterfly is not just a sparse mask, but a sparse mask with logarithmic-depth communication
Reproduce:
python scripts/experiment_butterfly_validity.py
python scripts/experiment_butterfly_staging_validity.py
pytest -q tests/pytorch/test_wayfinder_topology.py \
tests/pytorch/test_wayfinder_staging_validity.py \
tests/pytorch/test_wayfinder_operator_mixing.pyThe clearest in-repo release evidence today is the GLM-4.7-Flash-4bit stable profile documented in docs/FIRST_RELEASE.md.
At seq_len=8192 and decode_len=32 on the validated MLX path:
| Mode | E2E | Prefill | Decode tok/s | Peak memory |
|---|---|---|---|---|
| Dense | 17.15s | 16.36s | 40.58 | 20.66 GB |
| Butterfly | 10.56s | 9.75s | 39.85 | 20.07 GB |
| Delta vs dense | -38.44% | -40.38% | -1.79% | -2.85% |
That is the safest benchmark slice to cite publicly from this tree today.
The repo also contains experimental CUDA benchmark results for a Triton block-sparse path on Qwen 3.5 9B, where 8 of 32 layers are replaced and the remaining DeltaNet layers stay untouched.
| Context | Dense tok/s | Butterfly tok/s | Top-1 agreement |
|---|---|---|---|
| 4,096 | — | — | 99.88% |
| 8,192 | 1,651 | 1,698 | — |
| 16,384 | — | — | 94.44% |
| 32,768 | 1,585 | 1,688 | — |
| 65,536 | 1,475 | 1,724 | — |
| 98,304 | 1,413 | 1,660 | — |
| 131,072 | 1,365 | 1,667 | — |
| 262,144 | 1,257 | 1,712 | — |
These numbers suggest flatter throughput than dense attention at long context, but this path should still be treated as experimental until the quality and support boundaries are documented as tightly as the GLM release path.
| Context | Dense tok/s | Butterfly tok/s |
|---|---|---|
| 8,192 | 931 | 954 |
| 32,768 | 1,280 | 1,301 |
| 65,536 | 1,241 | 1,326 |
| 131,072 | 1,131 | 1,331 |
| 163,840 | — | 1,306 |
| 196,608 | — | 1,364 |
| 229,376 | — | 1,233 |
MLX permute-window path with K6 fused Metal kernel, window=64. 8 of 32 attention layers are replaced. Model: mlx-community/Qwen3.5-9B-MLX-4bit.
| Context | Dense TTFT | Butterfly TTFT | Dense tok/s | Butterfly tok/s | Peak memory |
|---|---|---|---|---|---|
| 2,048 | 71 ms | 49 ms | 62.2 | 62.0 | 7.1 GB |
| 8,192 | 116 ms | 86 ms | 57.2 | 58.8 | 9.9 GB |
| 32,768 | 100 ms | 99 ms | 49.6 | 47.1 | 13.7 GB |
| 65,536 | 160 ms | 202 ms | 41.5 | 39.8 | 18.9 GB |
| 98,304 | 2.0 s | 1.2 s | 17.2 | 22.4 | 24.0 GB |
| 131,072 | 6.9 s | 7.5 s | 7.3 | 6.8 | 29.1 GB |
| 163,840 | 26.8 s | 21.5 s | 2.2 | 2.7 | 34.2 GB |
This MLX path uses chunked-gather plus native SDPA for prefill and a fused Metal kernel for decode. It shows wins at short context and again near the memory wall, but it is still an experimental path rather than a validated public release.
Top-1 agreement in the Qwen 9B experiments is 99.88% at 4K and 94.44% at 16K. Perplexity and downstream evaluation are still in progress, so avoid treating these tables as universal quality-parity claims.
git clone https://github.com/Hmbown/Butterfly.git
cd Butterfly
pip install -e ".[dev,kernels]"Validated public path:
./scripts/run_public_stable_profile_glm.shExperimental Qwen CUDA benchmark:
python scripts/bench_qwen35_cuda_wayfinder.py \
--model-path <path-to-Qwen3.5-9B> \
--path block_sparse \
--engine triton \
--block-size 128 \
--seq-lens 4096 8192 16384 32768git clone https://github.com/Hmbown/Butterfly.git
cd Butterfly
pip install -e ".[mlx]"
pip install mlx-lm zmlxEnvironment check:
python scripts/env_check_mlx.pyExperimental Qwen MLX benchmark:
python scripts/bench_qwen_consumer_mlx.py \
--model-path mlx-community/Qwen3.5-9B-MLX-4bit \
--mode wayfinder \
--seq-lens 2048 8192 32768 \
--decode-len 256 \
--repeats 3 \
--out-dir results/benchmarks/my_runThe --mode dense flag runs the stock attention baseline for comparison. Add --skip-quality to benchmark only throughput.
Optional MLX-native KV-cache trial for decode-path evaluation:
python scripts/bench_qwen_consumer_mlx.py \
--model-path /Volumes/VIXinSSD/models/Qwen3.5-4B-MLX-4bit \
--mode butterfly \
--butterfly-decode-backend stock \
--seq-lens 2048 8192 \
--decode-len 8 \
--repeats 1 \
--chunk-size 384 \
--query-chunk-size 384 \
--kv-bits 4 \
--kv-group-size 64 \
--quantized-kv-start 0 \
--skip-multi-turn \
--skip-quality \
--hf-offline \
--out-dir results/benchmarks/qwen35_4b_mlx/kv4_trialNotes:
- In
--mode butterfly, keep--chunk-size <= --query-chunk-size. The benchmark now rejects invalid settings because later prefill chunks would otherwise fall back to stock attention. - The MLX KV quantization prototype reuses MLX-LM cache quantization on the full-attention layers only. Butterfly prefill remains dense; the working KV cache is quantized after prefill and before stock decode.
- Current Qwen 3.5 4B MLX interpretation and reporting package: docs/QWEN35_4B_MLX_BENCHMARK_REPORT.md
pytest
ruff check bna tests| Path | What it is |
|---|---|
bna/ |
Core package and backend integrations |
scripts/ |
Benchmarks, diagnostics, serving helpers, and figure generation |
docs/ |
Contributor-facing architecture, release evidence, and research notes |
benchmarks/, results/ |
Raw benchmark outputs and summaries |
notes/ |
Lab notebook, experiment log, handoff prompts, and planning material |
archive/ |
Older exploratory code and preserved artifacts |
- docs/FIRST_RELEASE.md: validated benchmark slice and reproduction commands
- docs/QWEN35_4B_MLX_BENCHMARK_REPORT.md: Butterfly-first Qwen 3.5 4B MLX benchmark report and long-context interpretation
- docs/ARCHITECTURE.md: contributor-facing implementation map
- docs/APPLE_SILICON_SETUP.md: Apple Silicon bootstrap, llama.cpp Metal baseline, model catalog
- CONTRIBUTING.md: expectations for docs, claims, and performance changes
MIT
