Summary
The README claims Qwen3.6-27B Q4_K_M at 128K on a single RTX 3090 (24 GB). I cannot reproduce this through server.py. The benchmark (pflash/tests/bench_niah_cpp.py) works because it uses a park/unpark dance and sets --max-ctx 16384 sized for the compressed prompt — not the full source.
server.py loads both target + draft at startup, keeps them resident, and allocates KV cache eagerly based on --max-ctx. This leaves no headroom on a 24 GB card regardless of flag combinations.
Hardware
- GPU: NVIDIA GeForce RTX 4090, 24564 MiB VRAM, SM 8.9
- CUDA: 13.2, nvcc 13.2.78
- Driver: 595.71.05
- Target: Qwen3.6-27B-Q4_K_M.gguf (17 GB on disk)
- Draft: z-lab/Qwen3.6-27B-DFlash model.safetensors
- Drafter: Qwen3-0.6B-BF16.gguf
The gap: bench vs server
The benchmark (pflash/tests/bench_niah_cpp.py) does this per-request dance:
# 1. Compress with drafter (daemon idles at ~3 GB after parking target+draft)
compressed_ids = dflash.compress(ids, args.keep_ratio, args.drafter_gguf)
# 2. Free drafter, restore target+draft for generation
dflash.free_drafter() # release drafter weights + KV + BSA scratch
dflash.unpark_target() # ~16 GB
dflash.unpark_draft() # +draft weights for spec decode
# 3. Generate on compressed prompt (fits in small max_ctx)
out_ids = dflash.generate(target_ids, args.n_gen)
# 4. Re-park draft for next iteration
dflash.park_draft()
Key: --max-ctx 16384 is sized for the compressed prompt, not the source. After compressing 128K → 2.5K tokens (keep_ratio=0.02), generation fits with minimal KV cache.
server.py does none of this:
- Loads target + draft at startup, keeps them resident forever
- Allocates max_ctx KV cache eagerly on daemon spawn
- Has no park/unpark between prefill and generate
--budget 0 does not skip draft loading (clamped to 64 in C++ config)
Configurations tested via server.py
All failed with OOM. Each reached [daemon] ready then crashed on first request (~20-23K token prompt).
Boot log (common)
[target] target loaded: 851 tensors on GPU 14.99 GiB, tok\_embd 682 MiB CPU-only (q4\_K)
[draft] loaded
[daemon] ready
Target + draft = ~20 GB VRAM. Only ~4 GB headroom.
1. PFlash on, max-ctx 200K
python scripts/server.py --target models/Qwen3.6-27B-Q4_K_M.gguf \
--draft models/draft --bin build/test_dflash \
--max-ctx 200000 --budget 16 --fa-window 0 \
--prefill-compression auto --prefill-drafter models/Qwen3-0.6B-BF16.gguf
Result: OOM during target prefill intermediates (before PFlash even fires)
[prompt] 16904 tokens
[prefill] token-seg ubatch=384
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1215.29 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 1274320896
prefill build @src/qwen3_0p6b_loader.cpp
2. PFlash on, max-ctx 131K, larger FA window
--max-ctx 131072 --budget 22 --fa-window 32000 --prefill-compression auto ...
Result: Prefill succeeded but rollback cache OOM
[prefill] 20042 tokens in 15.23 s, last_tok=0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1854.38 MiB on device 0: cudaMalloc failed: out of memory
cache migration: ggml_backend_alloc_ctx_tensors failed for rollback cache
3. PFlash off, budget=0 (no spec-decode), max-ctx 65K
--max-ctx 65536 --budget 0 --fa-window 16384 --prefill-compression off
Result: Prefill succeeded but rollback cache OOM (~5 GB)
[cfg] ... budget=64 temp=1.00 ... fa_window=16384
[prefill] 21051 tokens in 16.30 s, last_tok=0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4957.12 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 5197922304
cache migration: ggml_backend_alloc_ctx_tensors failed for rollback cache
4. PFlash off, budget=0, max-ctx 24K (smallest tested)
--max-ctx 24000 --budget 0 --fa-window 16384 --prefill-compression off
Result: Same ~5 GB rollback OOM — identical allocation size regardless of max-ctx
[cfg] ... budget=64 temp=1.00 ... fa_window=16384
[prefill] 23071 tokens in 17.96 s, last_tok=0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4957.12 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 5197922304
cache migration: ggml_backend_alloc_ctx_tensors failed for rollback cache
Bugs observed
-
--budget 0 does not disable draft loading. C++ config clamps to budget=64. Draft model (~5 GB) still loaded → dead VRAM when spec-decode is effectively off.
-
Rollback cache allocated unconditionally. ~5 GB regardless of max-ctx, budget, or prompt length. Happens even with effective budget=0 (no draft tokens).
-
server.py does not implement the park/unpark dance that makes 24 GB cards work. The benchmark proves it is possible — but requires the two-phase compress → generate protocol from bench_niah_cpp.py.
What works
The bench harness at pflash/tests/bench_niah_cpp.py completes successfully:
- Spawns daemon with
--max-ctx 16384 (sized for compressed prompt)
- Parks target+draft → compresses with drafter → frees drafter
- Unparks target+draft → generates on compressed prompt → parks draft
- 128K source → 2.5K compressed → fits in small KV cache
Request
Either:
- Make
server.py implement the park/unpark dance for PFlash requests (matching bench protocol)
- Add a
--no-draft flag to skip draft loading when budget=0
- Skip rollback cache allocation when budget ≤ 0
- Document that
server.py does not support 24 GB cards and provide an alternative entry point
The benchmark proves the algorithm works on 24 GB — but only with careful VRAM sequencing that server.py does not implement.
Summary
The README claims Qwen3.6-27B Q4_K_M at 128K on a single RTX 3090 (24 GB). I cannot reproduce this through
server.py. The benchmark (pflash/tests/bench_niah_cpp.py) works because it uses a park/unpark dance and sets--max-ctx 16384sized for the compressed prompt — not the full source.server.pyloads both target + draft at startup, keeps them resident, and allocates KV cache eagerly based on--max-ctx. This leaves no headroom on a 24 GB card regardless of flag combinations.Hardware
The gap: bench vs server
The benchmark (
pflash/tests/bench_niah_cpp.py) does this per-request dance:Key:
--max-ctx 16384is sized for the compressed prompt, not the source. After compressing 128K → 2.5K tokens (keep_ratio=0.02), generation fits with minimal KV cache.server.pydoes none of this:--budget 0does not skip draft loading (clamped to 64 in C++ config)Configurations tested via server.py
All failed with OOM. Each reached
[daemon] readythen crashed on first request (~20-23K token prompt).Boot log (common)
Target + draft = ~20 GB VRAM. Only ~4 GB headroom.
1. PFlash on, max-ctx 200K
Result: OOM during target prefill intermediates (before PFlash even fires)
2. PFlash on, max-ctx 131K, larger FA window
Result: Prefill succeeded but rollback cache OOM
3. PFlash off, budget=0 (no spec-decode), max-ctx 65K
Result: Prefill succeeded but rollback cache OOM (~5 GB)
4. PFlash off, budget=0, max-ctx 24K (smallest tested)
Result: Same ~5 GB rollback OOM — identical allocation size regardless of max-ctx
Bugs observed
--budget 0does not disable draft loading. C++ config clamps tobudget=64. Draft model (~5 GB) still loaded → dead VRAM when spec-decode is effectively off.Rollback cache allocated unconditionally. ~5 GB regardless of max-ctx, budget, or prompt length. Happens even with effective budget=0 (no draft tokens).
server.py does not implement the park/unpark dance that makes 24 GB cards work. The benchmark proves it is possible — but requires the two-phase compress → generate protocol from
bench_niah_cpp.py.What works
The bench harness at
pflash/tests/bench_niah_cpp.pycompletes successfully:--max-ctx 16384(sized for compressed prompt)Request
Either:
server.pyimplement the park/unpark dance for PFlash requests (matching bench protocol)--no-draftflag to skip draft loading when budget=0server.pydoes not support 24 GB cards and provide an alternative entry pointThe benchmark proves the algorithm works on 24 GB — but only with careful VRAM sequencing that server.py does not implement.