Skip to content

feat(dflash): native multi-request scheduler with batched target step#135

Open
javierpazo wants to merge 1 commit intoLuce-Org:mainfrom
javierpazo:xabicasa/dflash-multi-request-scheduler-batched-target-step
Open

feat(dflash): native multi-request scheduler with batched target step#135
javierpazo wants to merge 1 commit intoLuce-Org:mainfrom
javierpazo:xabicasa/dflash-multi-request-scheduler-batched-target-step

Conversation

@javierpazo
Copy link
Copy Markdown
Contributor

@javierpazo javierpazo commented May 9, 2026

Summary

Brings concurrent multi-request execution to test_dflash on a
single GPU. Internally one cohesive unit; happy to split into
four sequential PRs (A / B / C / D below) if you prefer per
CONTRIBUTING's "one concern per PR"
— let me know and I'll
re-open as a chain. I kept it bundled because the four pieces
share the same hunks of test_dflash.cpp (~+2130 lines) and
splitting cleanly would require careful hunk surgery; doing it on
request is fine.

Pieces in this PR

A. Multi TargetCache slots

  • CLI: --target-cache-slots=N (alias --cache-slots=N)
  • prefix SLOT <id> routes commands to a specific slot
  • DaemonSlotState + RAII ActiveDaemonSlot for safe switching
  • LIST_TARGET_CACHE_SLOTS for introspection
  • all slots share target/draft weights; only KV / SSM / scratch is
    per-slot
  • create_target_cache gains an n_seqs parameter so a single
    cache can be allocated batched up front

B. Tagged stream protocol (opt-in)

  • --stream-tagged emits frames [-2, request_id, token] instead
    of bare int32 tokens; sentinels -4 (CONTINUE), -1 (DONE)
  • parser recognises REQ <id> / REQUEST <id> headers
  • legacy bare-int32 streaming is unchanged when the flag is off
  • lets a client demux multiple concurrent requests over the same
    stdout

C. Native quantum scheduler

  • dispatch table for REQ/SLOT/START, SCHED_STEP,
    SCHED_DRAIN, LIST_REQUESTS
  • cursor-based fair round-robin between admitted requests
  • non-blocking reader thread admits new requests during a drain
  • PendingQuantum{slot, req, epoch, n_gen} carries the unit of
    work
  • CONTINUE / CONT resumes a slot without re-prefilling
  • REQ <id> CANCEL invalidates a request and bumps the slot
    epoch so a stale CONTINUE is rejected; RESTORE_CHAIN and
    legacy generate refuse to overwrite a slot that is owned by
    an active scheduler request

D. Fused batched target step (CUDA path)

  • new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE,
    SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP,
    SCHED_BATCH_DRAIN
  • QwenGraphInputs gains n_seqs; build_delta_net_block
    accepts n_seqs > 1
  • target_feat is allocated as [5*hidden, target_feat_cap, n_seqs] when batched and the chain forwards capture features
    per-seq
  • rollback for partially accepted draft tokens, multi-token verify
    and parent-id propagation in the batched path are noted as
    follow-ups; today the batched step accepts the cleanest case
    and falls back to single-seq when needed

Validation

Per CONTRIBUTING ("benchmark before and after on the same hardware,
same warmup"). Single GPU1 RTX 6000 Ada (sm_89), Heretic Q4_K_M
target, Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV
q4_0/q4_0:

Scenario Result
Two concurrent requests, REQ 4 START SLOT 0 quantum=2 + REQ 5 START SLOT 1 quantum=2, then SCHED_DRAIN closes both clean; slot 0 = 18.41 tok/s, slot 1 = 22.50 tok/s
Mid-drain admission of REQ 6 succeeds; CONTINUE on slot 0 resumes without re-prefill
batch_probe_compare_ok over a 2-seq probe mismatches = 0 vs the single-seq path
batch_tail_commit (2 completed pending quanta) 29.26 ms
batch_step_commit followed by SCHED_DRAIN 29.57 ms, then reverts cleanly back to the DFlash single-seq path

Methodology: warmup of 1 request before measurement; same --budget
and KV-quant settings across runs; nothing else competing on the GPU
during the measurement window.

Compatibility

  • All new behaviour is opt-in. Default invocation of test_dflash
    with no scheduler flags keeps the legacy single-request path
    byte-identical.
  • Tagged stream gated behind --stream-tagged.
  • Multi-slot gated behind --target-cache-slots=N (default N=1).
  • Batched target step reached only via the SCHED_BATCH_* command
    family; legacy SCHED_STEP keeps using the single-seq path.
  • Hot-loop diagnostic logs (sync_us / step_debug) are gated
    behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the
    default path is unchanged.

Verification vs existing community PRs

Notes

  • Diff size warning: this branch was extracted from a working tree
    that drifted from main. If a hunk fails to apply on a fresh
    rebase or you spot anything off, ping me and I'll fix on the
    spot rather than push through.
  • Companion branches with smaller follow-ups (CMake sm_89 / BSA,
    gguf_draft_loader fallback, FP16 safetensors drafter, daemon
    scripts improvements, SWA mask wiring + contract test, PFlash
    operator notes) are sitting on
    https://github.com/javierpazo/lucebox-hub. Holding off on
    opening those until this one is in a known state.

Javier Pazó@xabicasaxabicasa@gmail.com

This change brings concurrent multi-request execution to test_dflash
on a single GPU. It is internally one cohesive unit but can be split
into four conceptual pieces if a smaller review is preferred:

1. Multi TargetCache slots
   - CLI: --target-cache-slots=N (alias --cache-slots=N)
   - prefix `SLOT <id>` routes commands to a specific slot
   - DaemonSlotState + RAII ActiveDaemonSlot for safe switching
   - LIST_TARGET_CACHE_SLOTS for introspection
   - all slots share target/draft weights; only KV/SSM/scratch is
     per-slot
   - create_target_cache gains an `n_seqs` parameter so a single
     cache can be allocated batched up front

2. Tagged stream protocol (opt-in)
   - --stream-tagged emits frames `[-2, request_id, token]` instead
     of bare int32 tokens; sentinels `-4` (CONTINUE), `-1` (DONE)
   - parser recognises `REQ <id>` / `REQUEST <id>` headers
   - legacy bare-int32 streaming is unchanged when the flag is off
   - this lets a client demux multiple concurrent requests over the
     same stdout

3. Native quantum scheduler
   - dispatch table for REQ/SLOT/START, SCHED_STEP, SCHED_DRAIN,
     LIST_REQUESTS
   - cursor-based fair round-robin between admitted requests
   - non-blocking reader thread admits new requests during a drain
   - PendingQuantum{slot, req, epoch, n_gen} carries the unit of work
   - CONTINUE / CONT resumes a slot without re-prefilling
   - REQ <id> CANCEL invalidates a request and bumps the slot epoch
     so a stale CONTINUE is rejected; RESTORE_CHAIN / legacy generate
     refuse to overwrite a slot that is owned by an active scheduler
     request

4. Fused batched target step (CUDA path)
   - new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE,
     SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP,
     SCHED_BATCH_DRAIN
   - QwenGraphInputs gains `n_seqs`; build_delta_net_block accepts
     n_seqs > 1
   - target_feat is allocated as [5*hidden, target_feat_cap, n_seqs]
     when batched and the chain forwards capture features per-seq
   - batch_probe_compare_ok smoke shows mismatches=0 vs the
     single-seq path; SCHED_BATCH_TARGET_TAIL commits two completed
     pending quanta in 29.26 ms; SCHED_BATCH_TARGET_STEP commits the
     next batched step in 29.57 ms; SCHED_BATCH_DRAIN completes
     req12/req13 with two batched steps each
   - rollback for partially accepted draft tokens, multi-token verify
     and parent-id propagation in the batched path are noted as
     follow-ups; today the batched step accepts the cleanest case
     and falls back to single-seq when needed

Validation (single GPU1 RTX 6000 Ada sm_89, Heretic Q4_K_M target +
Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV q4_0/q4_0):

- Two concurrent requests:
    REQ 4 START SLOT 0 quantum=2
    REQ 5 START SLOT 1 quantum=2
    SCHED_DRAIN closes both clean.
    slot 0: 18.41 tok/s, slot 1: 22.50 tok/s
- Mid-drain admission of REQ 6 succeeds; CONTINUE on slot 0 resumes
  without re-prefill.
- batch_probe_compare_ok mismatches=0 over a 2-seq probe.
- batch_tail_commit count=2 ms=29.26.
- batch_step_commit ms=29.57 followed by SCHED_DRAIN reverts cleanly
  back to the DFlash single-seq path.

Compatibility:
- All new behaviour is opt-in. Default invocation of test_dflash
  with no scheduler flags keeps the legacy single-request path.
- Tagged stream is gated behind --stream-tagged.
- Multi-slot is gated behind --target-cache-slots=N (default N=1).
- Batched target step is reached only via the SCHED_BATCH_* command
  family; legacy SCHED_STEP keeps using the single-seq path.
- Hot-loop diagnostic logs (sync_us / step_debug) are now gated
  behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the
  default path is unchanged.

Verification vs existing community PRs:
- No prior art in lucebox-hub for the SCHED_BATCH_* protocol or for
  a native C++ quantum scheduler with REQ/SLOT/CONTINUE/CANCEL +
  epoch hardening. Checked against PR Luce-Org#39 (CUDA graph reuse) and
  PR Luce-Org#62 (split target/draft StepGraphs); both reuse / split graphs
  but neither exposes a multi-request slot protocol.
- No upstream collision found for tagged stream framing or
  --target-cache-slots.

Happy to split this into four sequential PRs (slots / tagged stream /
quantum scheduler / batched target step) if a smaller-grained review
is preferred — let me know.

Author: Javier Pazo <xabicasa@gmail.com>
@javierpazo javierpazo changed the title dflash: native multi-request scheduler with batched target step feat(dflash): native multi-request scheduler with batched target step May 9, 2026
@davide221
Copy link
Copy Markdown
Contributor

Amazing contribution @javierpazo, thank you! Can you resolve the conflics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants