Cannot run Qwen3.6-27B on RTX 4090 24 GB: server.py loads both models eagerly, bench protocol works

## Summary

The README claims **Qwen3.6-27B Q4\_K\_M at 128K on a single RTX 3090** (24 GB). I cannot reproduce this through `server.py`. The benchmark (`pflash/tests/bench_niah_cpp.py`) works because it uses a park/unpark dance and sets `--max-ctx 16384` sized for the **compressed** prompt — not the full source.

`server.py` loads both target + draft at startup, keeps them resident, and allocates KV cache eagerly based on `--max-ctx`. This leaves no headroom on a 24 GB card regardless of flag combinations.

## Hardware

- GPU: NVIDIA GeForce RTX 4090, 24564 MiB VRAM, SM 8.9
- CUDA: 13.2, nvcc 13.2.78
- Driver: 595.71.05
- Target: Qwen3.6-27B-Q4\_K\_M.gguf (17 GB on disk)
- Draft: z-lab/Qwen3.6-27B-DFlash model.safetensors
- Drafter: Qwen3-0.6B-BF16.gguf

## The gap: bench vs server

The benchmark (`pflash/tests/bench_niah_cpp.py`) does this per-request dance:

```python
# 1. Compress with drafter (daemon idles at ~3 GB after parking target+draft)
compressed_ids = dflash.compress(ids, args.keep_ratio, args.drafter_gguf)

# 2. Free drafter, restore target+draft for generation
dflash.free_drafter()       # release drafter weights + KV + BSA scratch
dflash.unpark_target()      # ~16 GB
dflash.unpark_draft()       # +draft weights for spec decode

# 3. Generate on compressed prompt (fits in small max_ctx)
out_ids = dflash.generate(target_ids, args.n_gen)

# 4. Re-park draft for next iteration
dflash.park_draft()
```

**Key: `--max-ctx 16384` is sized for the compressed prompt**, not the source. After compressing 128K → 2.5K tokens (keep\_ratio=0.02), generation fits with minimal KV cache.

`server.py` does none of this:
- Loads target + draft at startup, keeps them resident forever
- Allocates max\_ctx KV cache eagerly on daemon spawn
- Has no park/unpark between prefill and generate
- `--budget 0` does not skip draft loading (clamped to 64 in C++ config)

## Configurations tested via server.py

All failed with OOM. Each reached `[daemon] ready` then crashed on first request (~20-23K token prompt).

### Boot log (common)

```
[target] target loaded: 851 tensors on GPU 14.99 GiB, tok\_embd 682 MiB CPU-only (q4\_K)
[draft]  loaded
[daemon] ready
```

**Target + draft = ~20 GB VRAM. Only ~4 GB headroom.**

### 1. PFlash on, max-ctx 200K

```bash
python scripts/server.py --target models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft models/draft --bin build/test_dflash \
  --max-ctx 200000 --budget 16 --fa-window 0 \
  --prefill-compression auto --prefill-drafter models/Qwen3-0.6B-BF16.gguf
```

**Result:** OOM during target prefill intermediates (before PFlash even fires)
```
[prompt] 16904 tokens
[prefill] token-seg ubatch=384
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1215.29 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 1274320896
prefill build @src/qwen3_0p6b_loader.cpp
```

### 2. PFlash on, max-ctx 131K, larger FA window

```bash
--max-ctx 131072 --budget 22 --fa-window 32000 --prefill-compression auto ...
```

**Result:** Prefill succeeded but rollback cache OOM
```
[prefill] 20042 tokens in 15.23 s, last_tok=0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1854.38 MiB on device 0: cudaMalloc failed: out of memory
cache migration: ggml_backend_alloc_ctx_tensors failed for rollback cache
```

### 3. PFlash off, budget=0 (no spec-decode), max-ctx 65K

```bash
--max-ctx 65536 --budget 0 --fa-window 16384 --prefill-compression off
```

**Result:** Prefill succeeded but rollback cache OOM (~5 GB)
```
[cfg] ... budget=64 temp=1.00 ... fa_window=16384
[prefill] 21051 tokens in 16.30 s, last_tok=0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4957.12 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 5197922304
cache migration: ggml_backend_alloc_ctx_tensors failed for rollback cache
```

### 4. PFlash off, budget=0, max-ctx 24K (smallest tested)

```bash
--max-ctx 24000 --budget 0 --fa-window 16384 --prefill-compression off
```

**Result:** **Same ~5 GB rollback OOM** — identical allocation size regardless of max-ctx
```
[cfg] ... budget=64 temp=1.00 ... fa_window=16384
[prefill] 23071 tokens in 17.96 s, last_tok=0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4957.12 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 5197922304
cache migration: ggml_backend_alloc_ctx_tensors failed for rollback cache
```

## Bugs observed

1. **`--budget 0` does not disable draft loading.** C++ config clamps to `budget=64`. Draft model (~5 GB) still loaded → dead VRAM when spec-decode is effectively off.

2. **Rollback cache allocated unconditionally.** ~5 GB regardless of max-ctx, budget, or prompt length. Happens even with effective budget=0 (no draft tokens).

3. **server.py does not implement the park/unpark dance** that makes 24 GB cards work. The benchmark proves it is possible — but requires the two-phase compress → generate protocol from `bench_niah_cpp.py`.

## What works

The bench harness at `pflash/tests/bench_niah_cpp.py` completes successfully:
- Spawns daemon with `--max-ctx 16384` (sized for compressed prompt)
- Parks target+draft → compresses with drafter → frees drafter
- Unparks target+draft → generates on compressed prompt → parks draft
- 128K source → 2.5K compressed → fits in small KV cache

## Request

Either:
1. Make `server.py` implement the park/unpark dance for PFlash requests (matching bench protocol)
2. Add a `--no-draft` flag to skip draft loading when budget=0
3. Skip rollback cache allocation when budget ≤ 0
4. Document that `server.py` does not support 24 GB cards and provide an alternative entry point

The benchmark proves the algorithm works on 24 GB — but only with careful VRAM sequencing that server.py does not implement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot run Qwen3.6-27B on RTX 4090 24 GB: server.py loads both models eagerly, bench protocol works #77

Summary

Hardware

The gap: bench vs server

Configurations tested via server.py

Boot log (common)

1. PFlash on, max-ctx 200K

2. PFlash on, max-ctx 131K, larger FA window

3. PFlash off, budget=0 (no spec-decode), max-ctx 65K

4. PFlash off, budget=0, max-ctx 24K (smallest tested)

Bugs observed

What works

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot run Qwen3.6-27B on RTX 4090 24 GB: server.py loads both models eagerly, bench protocol works #77

Description

Summary

Hardware

The gap: bench vs server

Configurations tested via server.py

Boot log (common)

1. PFlash on, max-ctx 200K

2. PFlash on, max-ctx 131K, larger FA window

3. PFlash off, budget=0 (no spec-decode), max-ctx 65K

4. PFlash off, budget=0, max-ctx 24K (smallest tested)

Bugs observed

What works

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions