Skip to content

Cannot run Qwen3.6-27B on RTX 4090 24 GB: server.py loads both models eagerly, bench protocol works #77

@n00b001

Description

@n00b001

Summary

The README claims Qwen3.6-27B Q4_K_M at 128K on a single RTX 3090 (24 GB). I cannot reproduce this through server.py. The benchmark (pflash/tests/bench_niah_cpp.py) works because it uses a park/unpark dance and sets --max-ctx 16384 sized for the compressed prompt — not the full source.

server.py loads both target + draft at startup, keeps them resident, and allocates KV cache eagerly based on --max-ctx. This leaves no headroom on a 24 GB card regardless of flag combinations.

Hardware

  • GPU: NVIDIA GeForce RTX 4090, 24564 MiB VRAM, SM 8.9
  • CUDA: 13.2, nvcc 13.2.78
  • Driver: 595.71.05
  • Target: Qwen3.6-27B-Q4_K_M.gguf (17 GB on disk)
  • Draft: z-lab/Qwen3.6-27B-DFlash model.safetensors
  • Drafter: Qwen3-0.6B-BF16.gguf

The gap: bench vs server

The benchmark (pflash/tests/bench_niah_cpp.py) does this per-request dance:

# 1. Compress with drafter (daemon idles at ~3 GB after parking target+draft)
compressed_ids = dflash.compress(ids, args.keep_ratio, args.drafter_gguf)

# 2. Free drafter, restore target+draft for generation
dflash.free_drafter()       # release drafter weights + KV + BSA scratch
dflash.unpark_target()      # ~16 GB
dflash.unpark_draft()       # +draft weights for spec decode

# 3. Generate on compressed prompt (fits in small max_ctx)
out_ids = dflash.generate(target_ids, args.n_gen)

# 4. Re-park draft for next iteration
dflash.park_draft()

Key: --max-ctx 16384 is sized for the compressed prompt, not the source. After compressing 128K → 2.5K tokens (keep_ratio=0.02), generation fits with minimal KV cache.

server.py does none of this:

  • Loads target + draft at startup, keeps them resident forever
  • Allocates max_ctx KV cache eagerly on daemon spawn
  • Has no park/unpark between prefill and generate
  • --budget 0 does not skip draft loading (clamped to 64 in C++ config)

Configurations tested via server.py

All failed with OOM. Each reached [daemon] ready then crashed on first request (~20-23K token prompt).

Boot log (common)

[target] target loaded: 851 tensors on GPU 14.99 GiB, tok\_embd 682 MiB CPU-only (q4\_K)
[draft]  loaded
[daemon] ready

Target + draft = ~20 GB VRAM. Only ~4 GB headroom.

1. PFlash on, max-ctx 200K

python scripts/server.py --target models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft models/draft --bin build/test_dflash \
  --max-ctx 200000 --budget 16 --fa-window 0 \
  --prefill-compression auto --prefill-drafter models/Qwen3-0.6B-BF16.gguf

Result: OOM during target prefill intermediates (before PFlash even fires)

[prompt] 16904 tokens
[prefill] token-seg ubatch=384
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1215.29 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 1274320896
prefill build @src/qwen3_0p6b_loader.cpp

2. PFlash on, max-ctx 131K, larger FA window

--max-ctx 131072 --budget 22 --fa-window 32000 --prefill-compression auto ...

Result: Prefill succeeded but rollback cache OOM

[prefill] 20042 tokens in 15.23 s, last_tok=0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1854.38 MiB on device 0: cudaMalloc failed: out of memory
cache migration: ggml_backend_alloc_ctx_tensors failed for rollback cache

3. PFlash off, budget=0 (no spec-decode), max-ctx 65K

--max-ctx 65536 --budget 0 --fa-window 16384 --prefill-compression off

Result: Prefill succeeded but rollback cache OOM (~5 GB)

[cfg] ... budget=64 temp=1.00 ... fa_window=16384
[prefill] 21051 tokens in 16.30 s, last_tok=0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4957.12 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 5197922304
cache migration: ggml_backend_alloc_ctx_tensors failed for rollback cache

4. PFlash off, budget=0, max-ctx 24K (smallest tested)

--max-ctx 24000 --budget 0 --fa-window 16384 --prefill-compression off

Result: Same ~5 GB rollback OOM — identical allocation size regardless of max-ctx

[cfg] ... budget=64 temp=1.00 ... fa_window=16384
[prefill] 23071 tokens in 17.96 s, last_tok=0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4957.12 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 5197922304
cache migration: ggml_backend_alloc_ctx_tensors failed for rollback cache

Bugs observed

  1. --budget 0 does not disable draft loading. C++ config clamps to budget=64. Draft model (~5 GB) still loaded → dead VRAM when spec-decode is effectively off.

  2. Rollback cache allocated unconditionally. ~5 GB regardless of max-ctx, budget, or prompt length. Happens even with effective budget=0 (no draft tokens).

  3. server.py does not implement the park/unpark dance that makes 24 GB cards work. The benchmark proves it is possible — but requires the two-phase compress → generate protocol from bench_niah_cpp.py.

What works

The bench harness at pflash/tests/bench_niah_cpp.py completes successfully:

  • Spawns daemon with --max-ctx 16384 (sized for compressed prompt)
  • Parks target+draft → compresses with drafter → frees drafter
  • Unparks target+draft → generates on compressed prompt → parks draft
  • 128K source → 2.5K compressed → fits in small KV cache

Request

Either:

  1. Make server.py implement the park/unpark dance for PFlash requests (matching bench protocol)
  2. Add a --no-draft flag to skip draft loading when budget=0
  3. Skip rollback cache allocation when budget ≤ 0
  4. Document that server.py does not support 24 GB cards and provide an alternative entry point

The benchmark proves the algorithm works on 24 GB — but only with careful VRAM sequencing that server.py does not implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions