Skip to content

perf(pflash): add SM75 target-resident TTFT path#72

Open
weicj wants to merge 1 commit intoLuce-Org:mainfrom
weicj:pflash-sm75-ttft-residency
Open

perf(pflash): add SM75 target-resident TTFT path#72
weicj wants to merge 1 commit intoLuce-Org:mainfrom
weicj:pflash-sm75-ttft-residency

Conversation

@weicj
Copy link
Copy Markdown
Contributor

@weicj weicj commented May 1, 2026

perf(pflash): add SM75 target-resident TTFT path

Summary

This PR adds an opt-in SM75 / RTX 2080 Ti path for PFlash TTFT-oriented use cases.

  • Adds FP16 drafter compute support for SM75, including BF16->F16 GGUF load conversion when native BF16 tensor cores are unavailable.
  • Tunes the WMMA FlashPrefill fallback for SM75 with padded shared-memory layout and DFLASH_FP_K_TILE=32.
  • Adds PFlash chunk selection with rare-query lexical rescue, plus test_pflash_chunk_select.
  • Adds opt-in target-resident PFlash daemon flow:
    • DFLASH_PFLASH_KEEP_TARGET=1
    • DFLASH_PFLASH_SKIP_DRAFT_RELOAD=1
    • default-off fallback when the DFlash draft remains parked
    • skip migrate_prefill_cache in that fallback because rollback tensors are not used
  • Documents the SM75 benchmark as an experimental TTFT path, not as full PFlash + DFlash speculative decode.

Benchmark

Hardware and environment:

  • GPU: RTX 2080 Ti 22 GB / SM75
  • Driver: 595.58.03
  • CUDA toolkit: 12.0
  • CMake: 3.28.3
  • Power limit: 280 W
  • Persistence mode: enabled
  • Target: Qwen3.6-27B Q4_K_M via test_dflash
  • PFlash drafter: Qwen3-0.6B FP16 GGUF
  • Prompt: same 16K NIAH qtail prompt
  • Methodology: warm daemon request timing after model load; tokenizers treated as preloaded; same hardware, same power limit, same prompt
Case Request time Speedup Notes
no PFlash 50.35 s 1.00x original 16K prompt
current PFlash hook 26.11 s 1.93x parks and reloads target + draft
DFLASH_PFLASH_KEEP_TARGET=1 14.19 s 3.55x target stays resident, draft reloads
KEEP_TARGET=1 + SKIP_DRAFT_RELOAD=1 4.13 s 12.21x TTFT-only fallback; no speculative decode

Correctness / Validation

  • cmake --build build-sm75-f16 --target test_dflash test_pflash_chunk_select test_flashprefill_kernels -j$(nproc) passes.
  • test_pflash_chunk_select passes.
  • test_flashprefill_kernels passes on SM75:
    • mean vector max diff 0.00000
    • sparse forward max diff 0.00043
    • S=8192 e2e FlashPrefill 13.8 ms / iter
  • _prefill_hook.py passes python -m py_compile.
  • Clean PR-worktree SM75 build passes:
    cmake --build dflash/build-sm75-pr --target test_dflash test_pflash_chunk_select test_flashprefill_kernels -j24.
  • 16K NIAH quality smoke retained the key and answer on the original prompt and a 5-position synthetic NIAH sweep.

Caveats

  • This is a PFlash TTFT path. It intentionally does not claim new DFlash/DDTree decode results on RTX 2080 Ti.
  • With DFLASH_PFLASH_SKIP_DRAFT_RELOAD=1, the draft remains parked after compressed prefill. This is a default-off fallback for TTFT / very short output, not a decode-speed path.
  • A keep-target + DFlash draft reload countercheck at max_ctx=17000 hit a 106 MiB CUDA allocation OOM in the draft graph on RTX 2080 Ti.
  • The quality validation here is retrieval-style NIAH smoke, not broad Math/GSM/code/chat validation.
  • The new residency flags remain default-off.

@howard0su
Copy link
Copy Markdown
Contributor

I tried this path as well to convert draft model from BF16 to FP16 in order to leverage 2080Ti's Tensor Core. Based on some experiment, I prefer using Q8_0. check PR #71

  1. 2080 only has 22GB VRAM. Q8_0 will save about 1.35GB which technically match the main experiment platform 3090 24GB.
  2. No need to use BF16 on draft model as our main model is already quantized pretty aggressively already.
  3. A small perf gain compare to FP16 (8.0% on my machine.)

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 18 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/qwen3_0p6b_loader.cpp">

<violation number="1" location="dflash/src/qwen3_0p6b_loader.cpp:92">
P2: BF16 capability check uses compile-time minimum SM instead of runtime GPU capability, causing false negatives on SM80+ devices in mixed-arch builds.</violation>
</file>

<file name="dflash/test/test_dflash.cpp">

<violation number="1" location="dflash/test/test_dflash.cpp:1452">
P2: Target-only/decode gating does not validate that target weights are resident, allowing generation paths to run after `park target` freed them.</violation>
</file>

<file name="dflash/test/test_flashprefill_kernels.cpp">

<violation number="1" location="dflash/test/test_flashprefill_kernels.cpp:233">
P1: Numerical validation can pass even when GPU outputs are NaN, because the max-diff accumulation ignores non-finite values.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread dflash/test/test_flashprefill_kernels.cpp
Comment thread dflash/src/qwen3_0p6b_loader.cpp Outdated
Comment thread dflash/test/test_dflash.cpp
@weicj weicj force-pushed the pflash-sm75-ttft-residency branch from 33b7869 to 4645fa8 Compare May 1, 2026 13:57
@weicj
Copy link
Copy Markdown
Contributor Author

weicj commented May 1, 2026

Thanks. I pushed an update in 4645fa8.

Fixes for the three issues identified by cubic:

  • test_flashprefill_kernels: numerical validation now fails immediately on non-finite reference/output/diff values, so NaN/Inf cannot be hidden by max-diff accumulation.
  • qwen3_0p6b_loader: BF16 capability is now checked from the active CUDA device at runtime via cudaGetDeviceProperties, while DFLASH27B_DRAFT_FP16=1 still forces the FP16 path. This avoids mixed-arch false negatives on SM80+ devices.
  • test_dflash: generate now rejects requests while target weights are parked, before entering target-only or speculative decode.

I also agree Q8_0 is a good direction for the 2080 Ti path, especially for the VRAM and perf reasons in #71. I kept this PR scoped to the FP16/BF16->F16 SM75 enablement plus PFlash TTFT/residency path so it does not collide with #71; happy to either rebase onto the Q8_0 draft path after #71 lands or split a follow-up patch for that.

@cubic-dev-ai
Copy link
Copy Markdown

cubic-dev-ai Bot commented May 2, 2026

@cubic-dev-ai please re-run the review/check on the current head commit.

@weicj I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 18 files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants