feat(dflash): native multi-request scheduler with batched target step#135
Open
javierpazo wants to merge 1 commit intoLuce-Org:mainfrom
Open
Conversation
This change brings concurrent multi-request execution to test_dflash
on a single GPU. It is internally one cohesive unit but can be split
into four conceptual pieces if a smaller review is preferred:
1. Multi TargetCache slots
- CLI: --target-cache-slots=N (alias --cache-slots=N)
- prefix `SLOT <id>` routes commands to a specific slot
- DaemonSlotState + RAII ActiveDaemonSlot for safe switching
- LIST_TARGET_CACHE_SLOTS for introspection
- all slots share target/draft weights; only KV/SSM/scratch is
per-slot
- create_target_cache gains an `n_seqs` parameter so a single
cache can be allocated batched up front
2. Tagged stream protocol (opt-in)
- --stream-tagged emits frames `[-2, request_id, token]` instead
of bare int32 tokens; sentinels `-4` (CONTINUE), `-1` (DONE)
- parser recognises `REQ <id>` / `REQUEST <id>` headers
- legacy bare-int32 streaming is unchanged when the flag is off
- this lets a client demux multiple concurrent requests over the
same stdout
3. Native quantum scheduler
- dispatch table for REQ/SLOT/START, SCHED_STEP, SCHED_DRAIN,
LIST_REQUESTS
- cursor-based fair round-robin between admitted requests
- non-blocking reader thread admits new requests during a drain
- PendingQuantum{slot, req, epoch, n_gen} carries the unit of work
- CONTINUE / CONT resumes a slot without re-prefilling
- REQ <id> CANCEL invalidates a request and bumps the slot epoch
so a stale CONTINUE is rejected; RESTORE_CHAIN / legacy generate
refuse to overwrite a slot that is owned by an active scheduler
request
4. Fused batched target step (CUDA path)
- new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE,
SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP,
SCHED_BATCH_DRAIN
- QwenGraphInputs gains `n_seqs`; build_delta_net_block accepts
n_seqs > 1
- target_feat is allocated as [5*hidden, target_feat_cap, n_seqs]
when batched and the chain forwards capture features per-seq
- batch_probe_compare_ok smoke shows mismatches=0 vs the
single-seq path; SCHED_BATCH_TARGET_TAIL commits two completed
pending quanta in 29.26 ms; SCHED_BATCH_TARGET_STEP commits the
next batched step in 29.57 ms; SCHED_BATCH_DRAIN completes
req12/req13 with two batched steps each
- rollback for partially accepted draft tokens, multi-token verify
and parent-id propagation in the batched path are noted as
follow-ups; today the batched step accepts the cleanest case
and falls back to single-seq when needed
Validation (single GPU1 RTX 6000 Ada sm_89, Heretic Q4_K_M target +
Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV q4_0/q4_0):
- Two concurrent requests:
REQ 4 START SLOT 0 quantum=2
REQ 5 START SLOT 1 quantum=2
SCHED_DRAIN closes both clean.
slot 0: 18.41 tok/s, slot 1: 22.50 tok/s
- Mid-drain admission of REQ 6 succeeds; CONTINUE on slot 0 resumes
without re-prefill.
- batch_probe_compare_ok mismatches=0 over a 2-seq probe.
- batch_tail_commit count=2 ms=29.26.
- batch_step_commit ms=29.57 followed by SCHED_DRAIN reverts cleanly
back to the DFlash single-seq path.
Compatibility:
- All new behaviour is opt-in. Default invocation of test_dflash
with no scheduler flags keeps the legacy single-request path.
- Tagged stream is gated behind --stream-tagged.
- Multi-slot is gated behind --target-cache-slots=N (default N=1).
- Batched target step is reached only via the SCHED_BATCH_* command
family; legacy SCHED_STEP keeps using the single-seq path.
- Hot-loop diagnostic logs (sync_us / step_debug) are now gated
behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the
default path is unchanged.
Verification vs existing community PRs:
- No prior art in lucebox-hub for the SCHED_BATCH_* protocol or for
a native C++ quantum scheduler with REQ/SLOT/CONTINUE/CANCEL +
epoch hardening. Checked against PR Luce-Org#39 (CUDA graph reuse) and
PR Luce-Org#62 (split target/draft StepGraphs); both reuse / split graphs
but neither exposes a multi-request slot protocol.
- No upstream collision found for tagged stream framing or
--target-cache-slots.
Happy to split this into four sequential PRs (slots / tagged stream /
quantum scheduler / batched target step) if a smaller-grained review
is preferred — let me know.
Author: Javier Pazo <xabicasa@gmail.com>
Contributor
|
Amazing contribution @javierpazo, thank you! Can you resolve the conflics? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings concurrent multi-request execution to
test_dflashon asingle GPU. Internally one cohesive unit; happy to split into
four sequential PRs (A / B / C / D below) if you prefer per
CONTRIBUTING's "one concern per PR" — let me know and I'll
re-open as a chain. I kept it bundled because the four pieces
share the same hunks of
test_dflash.cpp(~+2130 lines) andsplitting cleanly would require careful hunk surgery; doing it on
request is fine.
Pieces in this PR
A. Multi
TargetCacheslots--target-cache-slots=N(alias--cache-slots=N)SLOT <id>routes commands to a specific slotDaemonSlotState+ RAIIActiveDaemonSlotfor safe switchingLIST_TARGET_CACHE_SLOTSfor introspectionper-slot
create_target_cachegains ann_seqsparameter so a singlecache can be allocated batched up front
B. Tagged stream protocol (opt-in)
--stream-taggedemits frames[-2, request_id, token]insteadof bare int32 tokens; sentinels
-4(CONTINUE),-1(DONE)REQ <id>/REQUEST <id>headersstdout
C. Native quantum scheduler
REQ/SLOT/START,SCHED_STEP,SCHED_DRAIN,LIST_REQUESTSPendingQuantum{slot, req, epoch, n_gen}carries the unit ofwork
CONTINUE/CONTresumes a slot without re-prefillingREQ <id> CANCELinvalidates a request and bumps the slotepoch so a stale
CONTINUEis rejected;RESTORE_CHAINandlegacy
generaterefuse to overwrite a slot that is owned byan active scheduler request
D. Fused batched target step (CUDA path)
SCHED_BATCH_PEEK,SCHED_BATCH_PROBE,SCHED_BATCH_TARGET_TAIL,SCHED_BATCH_TARGET_STEP,SCHED_BATCH_DRAINQwenGraphInputsgainsn_seqs;build_delta_net_blockaccepts
n_seqs > 1target_featis allocated as[5*hidden, target_feat_cap, n_seqs]when batched and the chain forwards capture featuresper-seq
and parent-id propagation in the batched path are noted as
follow-ups; today the batched step accepts the cleanest case
and falls back to single-seq when needed
Validation
Per CONTRIBUTING ("benchmark before and after on the same hardware,
same warmup"). Single GPU1 RTX 6000 Ada (sm_89), Heretic Q4_K_M
target, Q8 GGUF or FP16 safetensors drafter,
FA_WINDOW=0, KVq4_0/q4_0:REQ 4 START SLOT 0 quantum=2+REQ 5 START SLOT 1 quantum=2, thenSCHED_DRAINREQ 6CONTINUEon slot 0 resumes without re-prefillbatch_probe_compare_okover a 2-seq probebatch_tail_commit(2 completed pending quanta)batch_step_commitfollowed bySCHED_DRAINMethodology: warmup of 1 request before measurement; same
--budgetand KV-quant settings across runs; nothing else competing on the GPU
during the measurement window.
Compatibility
test_dflashwith no scheduler flags keeps the legacy single-request path
byte-identical.
--stream-tagged.--target-cache-slots=N(defaultN=1).SCHED_BATCH_*commandfamily; legacy
SCHED_STEPkeeps using the single-seq path.sync_us/step_debug) are gatedbehind
DFLASH27B_TIMING_DEBUG/DFLASH27B_STEP_DEBUGso thedefault path is unchanged.
Verification vs existing community PRs
SCHED_BATCH_*protocolor for a native C++ quantum scheduler with
REQ/SLOT/CONTINUE/CANCEL + epoch hardening. Checked against:
is single-seq.
per spec-decode step) — splits but stays single-seq.
--target-cache-slots.Notes
that drifted from
main. If a hunk fails to apply on a freshrebase or you spot anything off, ping me and I'll fix on the
spot rather than push through.
gguf_draft_loaderfallback, FP16 safetensors drafter, daemonscripts improvements, SWA mask wiring + contract test, PFlash
operator notes) are sitting on
https://github.com/javierpazo/lucebox-hub. Holding off onopening those until this one is in a known state.
Javier Pazó — @xabicasa — xabicasa@gmail.com