bench(dflash,pflash): add CUDA/HIP mixed backend placement by weicj · Pull Request #122 · Luce-Org/lucebox-hub

weicj · 2026-05-07T12:56:19Z

bench(dflash,pflash): add CUDA/HIP mixed backend placement

Summary

Add CUDA/HIP mixed backend placement support for PFlash and DFlash bench harness execution.

This PR makes the PFlash/DFlash harness split paths backend-portable instead of CUDA-only or multi-GPU-specific. The same implementation can be built as separate CUDA or HIP binaries, then combined by the harness when CUDA/HIP mixed backend placement is needed:

single backend / single device, where the split path still runs inside one selected GPU backend;
single backend / multiple devices, where target execution may use the existing target layer split;
CUDA/HIP mixed backend, where PFlash phase split or DFlash draft split crosses a host-data/process boundary.

This PR keeps mixed placement in the bench/runtime harnesses first, so the multi-device and multi-backend behavior can be validated directly before the OpenAI-compatible server path is integrated in a follow-up PR.

Changes

Add DFLASH27B_GPU_BACKEND=cuda|hip to the DFlash CMake build.
Link DFlash runtime/test targets against the selected ggml backend (ggml-cuda or ggml-hip).
Add a small HIP compatibility shim for the current vendored ggml HIP snapshot.
Route non-CUDA-WMMA PFlash execution through ggml flash_attn_ext.
Pass the real Q/K/V ggml_type into the portable PFlash path so F16 and BF16 are not inferred from byte size.
Make draft projection weight type selection backend-aware instead of using a CUDA-only SM macro.
Extend phase_split_dual_gpu.py with:
- --pflash-backend cuda|hip
- --pflash-visible-devices
- --target-backend cuda|hip
- --target-visible-devices
- optional target generation after PFlash compression
- separate CUDA/HIP GPU resource monitoring
Add DFlash draft IPC support to test_dflash:
- --draft-ipc-daemon
- --draft-ipc-bin
- --draft-ipc-gpu
- --draft-ipc-work-dir
- --draft-ipc-ring-cap
Extend bench_he.py so HumanEval-style DFlash runs can pass through the draft IPC binary/options used for mixed-backend draft split validation.
Document CUDA/HIP mixed backend placement in dflash/docs/MIXED_BACKEND.md and keep SPEC_PREFILL.md focused on the spec-prefill daemon reference.

CUDA/HIP Mixed Backend Design

PFlash and DFlash already have flexible split flows. This PR makes those flows explicit backend-placement points across separate CUDA/HIP builds instead of tying the harness to one GPU vendor or one fixed topology.

The mixed-backend boundary stays at phase/model/process level:

PFlash phase split can place the PFlash drafter and target generation on different backends, or keep both on one backend.
DFlash draft split can place the draft model in a separate backend process and feed a target process through host IPC.
Target execution can remain single-device or use same-backend target layer split.

These split boundaries do not require moving GPU activations directly across vendors, so backend placement can be selected through build-time backend binaries plus runtime harness placement while cross-vendor target layer execution stays outside the design.

For PFlash, the boundary is host-side token/text data:

The PFlash daemon returns compressed drafter-token IDs.
The harness decodes those IDs to text.
The target tokenizer re-encodes that text for target generation.

For DFlash, the boundary is a separate draft process:

The target split path captures target-side feature slices.
The feature/noise inputs are passed through host IPC to the draft daemon.
The draft daemon returns hidden states to the target process for projection and verification.

This keeps mixed backend support compatible with single-device target runs and existing same-backend target layer split behavior while avoiding cross-vendor target layer execution.

Validation

Validation used mixed CUDA/HIP hardware to cover the cross-backend case. CUDA-only, HIP-only, single-device, and same-backend multi-device placements use the same CMake backend selector and harness plumbing, but the table below focuses on the cases that prove both backend directions:

CUDA side: RTX 2080 Ti (sm_75), NVIDIA driver 595.58.03, CUDA toolkit 12.0.140 (nvidia-smi reports CUDA driver capability 13.2).
HIP side: AMD Radeon Pro VII (gfx906), ROCm 7.2.1, HIP 7.2.53211.

Path	Direction	Draft/PFlash side	Target side	Model	Test	Result
PFlash phase split	HIP -> CUDA	HIP on 1x Pro VII (`gfx906`)	CUDA target split on 2x RTX 2080 Ti (`sm_75`)	Qwen3.6-27B-Q8_0	NIAH 4K / 8K / 16K	Passed, key and answer retained, kept-token ratio 4.4-4.8%
PFlash phase split	CUDA -> HIP	CUDA on 1x RTX 2080 Ti (`sm_75`)	HIP target split on 2x Pro VII (`gfx906`)	Qwen3.6-27B-Q8_0	NIAH 4K / 8K / 16K	Passed, key and answer retained, kept-token ratio 4.4-4.8%
DFlash draft split	HIP -> CUDA	HIP draft IPC daemon on 1x Pro VII (`gfx906`)	CUDA target split on 2x RTX 2080 Ti (`sm_75`)	Qwen3.6-27B-Q8_0	HumanEval 10 prompts, n_gen=256	Passed, mean AL 8.88, accept 50.8%, decode 35.68 tok/s
DFlash draft split	CUDA -> HIP	CUDA draft IPC daemon on 1x RTX 2080 Ti (`sm_75`)	HIP target split on 2x Pro VII (`gfx906`)	Qwen3.6-27B-Q8_0	HumanEval 10 prompts, n_gen=256	Passed, mean AL 8.69, accept 49.6%, decode 20.86 tok/s

Additional checks:

CUDA build passed for the mixed-backend branch.
HIP build passed for the mixed-backend branch.
phase_split_dual_gpu.py was exercised in both backend directions.
bench_he.py was exercised with mixed-backend DFlash draft IPC in both backend directions.

cubic-dev-ai

2 issues found across 12 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/scripts/phase_split_dual_gpu.py">

<violation number="1" location="dflash/scripts/phase_split_dual_gpu.py:441">
P2: Failed target runs can crash the harness during output parsing before per-case failure aggregation runs.</violation>

<violation number="2" location="dflash/scripts/phase_split_dual_gpu.py:609">
P2: Legacy `resource_summary` is keyed by the logical GPU index, so visible-device remapping can make it fall back to zero samples even though monitoring data exists under the physical device key.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

howard0su · 2026-05-07T13:01:55Z

+#endif
+}
+
+static int run_dflash_draft_ipc_daemon(const char * draft_path,


can we abstract interface for draft? so we can have in-context (same gpu), two gpus, two gpus with amd/nvidia. put three different implementation into three files with some common/shared functions.

this file is too large to maintain now.

For this PR, I kept the change focused on validating the CUDA/HIP mixed backend path at the existing bench harness boundary. Pulling the draft interface refactor into the same patch would be a larger cleanup and would make the validation surface larger. But yes it is a good follow-up direction.

suggest do the refectoring first then add HIP mixed backend support.

howard0su · 2026-05-07T13:02:19Z

@@ -0,0 +1,39 @@
+#pragma once


why not put under hip_compat?

I kept it outside hip_compat because it is not a HIP-only shim. The hip_compat directory is currently used for vendored ggml HIP header compatibility, such as cuda_fp16.h. This header is our own runtime compatibility layer for the harness: CUDA builds include cuda_runtime.h through it, while HIP builds map the existing cuda* runtime spellings to hip*.

howard0su · 2026-05-07T13:03:40Z

 See `src/flashprefill.h` for the full list and defaults.

-## Dual-GPU PFlash phase split
+## Hybrid PFlash phase split


why override? but overall this file is not well structured. maybe you can replace a good user instruction.

Agreed. let me move the hybrid CUDA/HIP placement instructions into a dedicated docs/MIXED_BACKEND.md and leave only a short pointer from SPEC_PREFILL.md. That should make it easier to check and follow. :)

howard0su · 2026-05-07T13:04:15Z

    int batch, int seq_len, int n_q_heads, int n_k_heads, int head_dim,
    float scale,
-    int elem_size,
+    ggml_type qkv_type,


good approach!!

cubic-dev-ai Bot reviewed May 7, 2026

View reviewed changes

Comment thread dflash/scripts/phase_split_dual_gpu.py Outdated

Comment thread dflash/scripts/phase_split_dual_gpu.py Outdated

howard0su reviewed May 7, 2026

View reviewed changes

bench(dflash,pflash): add CUDA/HIP mixed backend placement

c5ffdb3

weicj force-pushed the bench-cuda-hip-mixed-backend-placement branch from c024951 to c5ffdb3 Compare May 7, 2026 13:02

howard0su reviewed May 7, 2026

View reviewed changes

docs(dflash): clarify mixed backend harness guidance

a41da43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(dflash,pflash): add CUDA/HIP mixed backend placement#122

bench(dflash,pflash): add CUDA/HIP mixed backend placement#122
weicj wants to merge 2 commits intoLuce-Org:mainfrom
weicj:bench-cuda-hip-mixed-backend-placement

weicj commented May 7, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

howard0su May 7, 2026

Uh oh!

weicj May 7, 2026

Uh oh!

howard0su May 7, 2026

Uh oh!

howard0su May 7, 2026

Uh oh!

weicj May 7, 2026

Uh oh!

howard0su May 7, 2026

Uh oh!

weicj May 7, 2026

Uh oh!

howard0su May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weicj commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bench(dflash,pflash): add CUDA/HIP mixed backend placement

Summary

Changes

CUDA/HIP Mixed Backend Design

Validation

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weicj commented May 7, 2026 •

edited

Loading