feat(dflash): native multi-request scheduler with batched target step by javierpazo · Pull Request #135 · Luce-Org/lucebox-hub

javierpazo · 2026-05-09T09:50:09Z

Summary

Brings concurrent multi-request execution to test_dflash on a
single GPU. Internally one cohesive unit; happy to split into
four sequential PRs (A / B / C / D below) if you prefer per
CONTRIBUTING's "one concern per PR" — let me know and I'll
re-open as a chain. I kept it bundled because the four pieces
share the same hunks of test_dflash.cpp (~+2130 lines) and
splitting cleanly would require careful hunk surgery; doing it on
request is fine.

Pieces in this PR

A. Multi `TargetCache` slots

CLI: --target-cache-slots=N (alias --cache-slots=N)
prefix SLOT <id> routes commands to a specific slot
DaemonSlotState + RAII ActiveDaemonSlot for safe switching
LIST_TARGET_CACHE_SLOTS for introspection
all slots share target/draft weights; only KV / SSM / scratch is
per-slot
create_target_cache gains an n_seqs parameter so a single
cache can be allocated batched up front

B. Tagged stream protocol (opt-in)

--stream-tagged emits frames [-2, request_id, token] instead
of bare int32 tokens; sentinels -4 (CONTINUE), -1 (DONE)
parser recognises REQ <id> / REQUEST <id> headers
legacy bare-int32 streaming is unchanged when the flag is off
lets a client demux multiple concurrent requests over the same
stdout

C. Native quantum scheduler

dispatch table for REQ/SLOT/START, SCHED_STEP,
SCHED_DRAIN, LIST_REQUESTS
cursor-based fair round-robin between admitted requests
non-blocking reader thread admits new requests during a drain
PendingQuantum{slot, req, epoch, n_gen} carries the unit of
work
CONTINUE / CONT resumes a slot without re-prefilling
REQ <id> CANCEL invalidates a request and bumps the slot
epoch so a stale CONTINUE is rejected; RESTORE_CHAIN and
legacy generate refuse to overwrite a slot that is owned by
an active scheduler request

D. Fused batched target step (CUDA path)

new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE,
SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP,
SCHED_BATCH_DRAIN
QwenGraphInputs gains n_seqs; build_delta_net_block
accepts n_seqs > 1
target_feat is allocated as [5*hidden, target_feat_cap, n_seqs] when batched and the chain forwards capture features
per-seq
rollback for partially accepted draft tokens, multi-token verify
and parent-id propagation in the batched path are noted as
follow-ups; today the batched step accepts the cleanest case
and falls back to single-seq when needed

Validation

Per CONTRIBUTING ("benchmark before and after on the same hardware,
same warmup"). Single GPU1 RTX 6000 Ada (sm_89), Heretic Q4_K_M
target, Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV
q4_0/q4_0:

Scenario	Result
Two concurrent requests, `REQ 4 START SLOT 0 quantum=2` + `REQ 5 START SLOT 1 quantum=2`, then `SCHED_DRAIN`	closes both clean; slot 0 = 18.41 tok/s, slot 1 = 22.50 tok/s
Mid-drain admission of `REQ 6`	succeeds; `CONTINUE` on slot 0 resumes without re-prefill
`batch_probe_compare_ok` over a 2-seq probe	mismatches = 0 vs the single-seq path
`batch_tail_commit` (2 completed pending quanta)	29.26 ms
`batch_step_commit` followed by `SCHED_DRAIN`	29.57 ms, then reverts cleanly back to the DFlash single-seq path

Methodology: warmup of 1 request before measurement; same --budget
and KV-quant settings across runs; nothing else competing on the GPU
during the measurement window.

Compatibility

All new behaviour is opt-in. Default invocation of test_dflash
with no scheduler flags keeps the legacy single-request path
byte-identical.
Tagged stream gated behind --stream-tagged.
Multi-slot gated behind --target-cache-slots=N (default N=1).
Batched target step reached only via the SCHED_BATCH_* command
family; legacy SCHED_STEP keeps using the single-seq path.
Hot-loop diagnostic logs (sync_us / step_debug) are gated
behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the
default path is unchanged.

Verification vs existing community PRs

No prior art in lucebox-hub for the SCHED_BATCH_* protocol
or for a native C++ quantum scheduler with
REQ/SLOT/CONTINUE/CANCEL + epoch hardening. Checked against:
- PR feat(dflash): MoE 35B-A3B support + DDTree CUDA graph reuse #39 (CUDA graph reuse, MoE 35B-A3B + DDTree) — graph reuse
  is single-seq.
- PR dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55) #62 (split target/draft StepGraphs to fix gallocr realloc
  per spec-decode step) — splits but stays single-seq.
No upstream collision found for tagged stream framing or
--target-cache-slots.

Notes

Diff size warning: this branch was extracted from a working tree
that drifted from main. If a hunk fails to apply on a fresh
rebase or you spot anything off, ping me and I'll fix on the
spot rather than push through.
Companion branches with smaller follow-ups (CMake sm_89 / BSA,
gguf_draft_loader fallback, FP16 safetensors drafter, daemon
scripts improvements, SWA mask wiring + contract test, PFlash
operator notes) are sitting on
https://github.com/javierpazo/lucebox-hub. Holding off on
opening those until this one is in a known state.

Javier Pazó — @xabicasa — xabicasa@gmail.com

This change brings concurrent multi-request execution to test_dflash on a single GPU. It is internally one cohesive unit but can be split into four conceptual pieces if a smaller review is preferred: 1. Multi TargetCache slots - CLI: --target-cache-slots=N (alias --cache-slots=N) - prefix `SLOT <id>` routes commands to a specific slot - DaemonSlotState + RAII ActiveDaemonSlot for safe switching - LIST_TARGET_CACHE_SLOTS for introspection - all slots share target/draft weights; only KV/SSM/scratch is per-slot - create_target_cache gains an `n_seqs` parameter so a single cache can be allocated batched up front 2. Tagged stream protocol (opt-in) - --stream-tagged emits frames `[-2, request_id, token]` instead of bare int32 tokens; sentinels `-4` (CONTINUE), `-1` (DONE) - parser recognises `REQ <id>` / `REQUEST <id>` headers - legacy bare-int32 streaming is unchanged when the flag is off - this lets a client demux multiple concurrent requests over the same stdout 3. Native quantum scheduler - dispatch table for REQ/SLOT/START, SCHED_STEP, SCHED_DRAIN, LIST_REQUESTS - cursor-based fair round-robin between admitted requests - non-blocking reader thread admits new requests during a drain - PendingQuantum{slot, req, epoch, n_gen} carries the unit of work - CONTINUE / CONT resumes a slot without re-prefilling - REQ <id> CANCEL invalidates a request and bumps the slot epoch so a stale CONTINUE is rejected; RESTORE_CHAIN / legacy generate refuse to overwrite a slot that is owned by an active scheduler request 4. Fused batched target step (CUDA path) - new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE, SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP, SCHED_BATCH_DRAIN - QwenGraphInputs gains `n_seqs`; build_delta_net_block accepts n_seqs > 1 - target_feat is allocated as [5*hidden, target_feat_cap, n_seqs] when batched and the chain forwards capture features per-seq - batch_probe_compare_ok smoke shows mismatches=0 vs the single-seq path; SCHED_BATCH_TARGET_TAIL commits two completed pending quanta in 29.26 ms; SCHED_BATCH_TARGET_STEP commits the next batched step in 29.57 ms; SCHED_BATCH_DRAIN completes req12/req13 with two batched steps each - rollback for partially accepted draft tokens, multi-token verify and parent-id propagation in the batched path are noted as follow-ups; today the batched step accepts the cleanest case and falls back to single-seq when needed Validation (single GPU1 RTX 6000 Ada sm_89, Heretic Q4_K_M target + Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV q4_0/q4_0): - Two concurrent requests: REQ 4 START SLOT 0 quantum=2 REQ 5 START SLOT 1 quantum=2 SCHED_DRAIN closes both clean. slot 0: 18.41 tok/s, slot 1: 22.50 tok/s - Mid-drain admission of REQ 6 succeeds; CONTINUE on slot 0 resumes without re-prefill. - batch_probe_compare_ok mismatches=0 over a 2-seq probe. - batch_tail_commit count=2 ms=29.26. - batch_step_commit ms=29.57 followed by SCHED_DRAIN reverts cleanly back to the DFlash single-seq path. Compatibility: - All new behaviour is opt-in. Default invocation of test_dflash with no scheduler flags keeps the legacy single-request path. - Tagged stream is gated behind --stream-tagged. - Multi-slot is gated behind --target-cache-slots=N (default N=1). - Batched target step is reached only via the SCHED_BATCH_* command family; legacy SCHED_STEP keeps using the single-seq path. - Hot-loop diagnostic logs (sync_us / step_debug) are now gated behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the default path is unchanged. Verification vs existing community PRs: - No prior art in lucebox-hub for the SCHED_BATCH_* protocol or for a native C++ quantum scheduler with REQ/SLOT/CONTINUE/CANCEL + epoch hardening. Checked against PR Luce-Org#39 (CUDA graph reuse) and PR Luce-Org#62 (split target/draft StepGraphs); both reuse / split graphs but neither exposes a multi-request slot protocol. - No upstream collision found for tagged stream framing or --target-cache-slots. Happy to split this into four sequential PRs (slots / tagged stream / quantum scheduler / batched target step) if a smaller-grained review is preferred — let me know. Author: Javier Pazo <xabicasa@gmail.com>

davide221 · 2026-05-09T19:22:23Z

Amazing contribution @javierpazo, thank you! Can you resolve the conflics?

javierpazo changed the title ~~dflash: native multi-request scheduler with batched target step~~ feat(dflash): native multi-request scheduler with batched target step May 9, 2026

javierpazo mentioned this pull request May 9, 2026

feat(dflash): daemon scripts improvements (GPU split, Windows, defaults) #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): native multi-request scheduler with batched target step#135

feat(dflash): native multi-request scheduler with batched target step#135
javierpazo wants to merge 1 commit intoLuce-Org:mainfrom
javierpazo:xabicasa/dflash-multi-request-scheduler-batched-target-step

javierpazo commented May 9, 2026 •

edited

Loading

Uh oh!

davide221 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

javierpazo commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pieces in this PR

A. Multi TargetCache slots

B. Tagged stream protocol (opt-in)

C. Native quantum scheduler

D. Fused batched target step (CUDA path)

Validation

Compatibility

Verification vs existing community PRs

Notes

Uh oh!

davide221 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

javierpazo commented May 9, 2026 •

edited

Loading

A. Multi `TargetCache` slots