Skip to content

bench(dflash,pflash): add CUDA/HIP mixed backend placement#122

Open
weicj wants to merge 2 commits intoLuce-Org:mainfrom
weicj:bench-cuda-hip-mixed-backend-placement
Open

bench(dflash,pflash): add CUDA/HIP mixed backend placement#122
weicj wants to merge 2 commits intoLuce-Org:mainfrom
weicj:bench-cuda-hip-mixed-backend-placement

Conversation

@weicj
Copy link
Copy Markdown
Contributor

@weicj weicj commented May 7, 2026

bench(dflash,pflash): add CUDA/HIP mixed backend placement

Summary

Add CUDA/HIP mixed backend placement support for PFlash and DFlash bench harness execution.

This PR makes the PFlash/DFlash harness split paths backend-portable instead of CUDA-only or multi-GPU-specific. The same implementation can be built as separate CUDA or HIP binaries, then combined by the harness when CUDA/HIP mixed backend placement is needed:

  • single backend / single device, where the split path still runs inside one selected GPU backend;
  • single backend / multiple devices, where target execution may use the existing target layer split;
  • CUDA/HIP mixed backend, where PFlash phase split or DFlash draft split crosses a host-data/process boundary.

This PR keeps mixed placement in the bench/runtime harnesses first, so the multi-device and multi-backend behavior can be validated directly before the OpenAI-compatible server path is integrated in a follow-up PR.

Changes

  • Add DFLASH27B_GPU_BACKEND=cuda|hip to the DFlash CMake build.
  • Link DFlash runtime/test targets against the selected ggml backend (ggml-cuda or ggml-hip).
  • Add a small HIP compatibility shim for the current vendored ggml HIP snapshot.
  • Route non-CUDA-WMMA PFlash execution through ggml flash_attn_ext.
  • Pass the real Q/K/V ggml_type into the portable PFlash path so F16 and BF16 are not inferred from byte size.
  • Make draft projection weight type selection backend-aware instead of using a CUDA-only SM macro.
  • Extend phase_split_dual_gpu.py with:
    • --pflash-backend cuda|hip
    • --pflash-visible-devices
    • --target-backend cuda|hip
    • --target-visible-devices
    • optional target generation after PFlash compression
    • separate CUDA/HIP GPU resource monitoring
  • Add DFlash draft IPC support to test_dflash:
    • --draft-ipc-daemon
    • --draft-ipc-bin
    • --draft-ipc-gpu
    • --draft-ipc-work-dir
    • --draft-ipc-ring-cap
  • Extend bench_he.py so HumanEval-style DFlash runs can pass through the draft IPC binary/options used for mixed-backend draft split validation.
  • Document CUDA/HIP mixed backend placement in dflash/docs/MIXED_BACKEND.md and keep SPEC_PREFILL.md focused on the spec-prefill daemon reference.

CUDA/HIP Mixed Backend Design

PFlash and DFlash already have flexible split flows. This PR makes those flows explicit backend-placement points across separate CUDA/HIP builds instead of tying the harness to one GPU vendor or one fixed topology.

The mixed-backend boundary stays at phase/model/process level:

  • PFlash phase split can place the PFlash drafter and target generation on different backends, or keep both on one backend.
  • DFlash draft split can place the draft model in a separate backend process and feed a target process through host IPC.
  • Target execution can remain single-device or use same-backend target layer split.

These split boundaries do not require moving GPU activations directly across vendors, so backend placement can be selected through build-time backend binaries plus runtime harness placement while cross-vendor target layer execution stays outside the design.

For PFlash, the boundary is host-side token/text data:

  1. The PFlash daemon returns compressed drafter-token IDs.
  2. The harness decodes those IDs to text.
  3. The target tokenizer re-encodes that text for target generation.

For DFlash, the boundary is a separate draft process:

  1. The target split path captures target-side feature slices.
  2. The feature/noise inputs are passed through host IPC to the draft daemon.
  3. The draft daemon returns hidden states to the target process for projection and verification.

This keeps mixed backend support compatible with single-device target runs and existing same-backend target layer split behavior while avoiding cross-vendor target layer execution.

Validation

Validation used mixed CUDA/HIP hardware to cover the cross-backend case. CUDA-only, HIP-only, single-device, and same-backend multi-device placements use the same CMake backend selector and harness plumbing, but the table below focuses on the cases that prove both backend directions:

  • CUDA side: RTX 2080 Ti (sm_75), NVIDIA driver 595.58.03, CUDA toolkit 12.0.140 (nvidia-smi reports CUDA driver capability 13.2).
  • HIP side: AMD Radeon Pro VII (gfx906), ROCm 7.2.1, HIP 7.2.53211.
Path Direction Draft/PFlash side Target side Model Test Result
PFlash phase split HIP -> CUDA HIP on 1x Pro VII (gfx906) CUDA target split on 2x RTX 2080 Ti (sm_75) Qwen3.6-27B-Q8_0 NIAH 4K / 8K / 16K Passed, key and answer retained, kept-token ratio 4.4-4.8%
PFlash phase split CUDA -> HIP CUDA on 1x RTX 2080 Ti (sm_75) HIP target split on 2x Pro VII (gfx906) Qwen3.6-27B-Q8_0 NIAH 4K / 8K / 16K Passed, key and answer retained, kept-token ratio 4.4-4.8%
DFlash draft split HIP -> CUDA HIP draft IPC daemon on 1x Pro VII (gfx906) CUDA target split on 2x RTX 2080 Ti (sm_75) Qwen3.6-27B-Q8_0 HumanEval 10 prompts, n_gen=256 Passed, mean AL 8.88, accept 50.8%, decode 35.68 tok/s
DFlash draft split CUDA -> HIP CUDA draft IPC daemon on 1x RTX 2080 Ti (sm_75) HIP target split on 2x Pro VII (gfx906) Qwen3.6-27B-Q8_0 HumanEval 10 prompts, n_gen=256 Passed, mean AL 8.69, accept 49.6%, decode 20.86 tok/s

Additional checks:

  • CUDA build passed for the mixed-backend branch.
  • HIP build passed for the mixed-backend branch.
  • phase_split_dual_gpu.py was exercised in both backend directions.
  • bench_he.py was exercised with mixed-backend DFlash draft IPC in both backend directions.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 12 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/scripts/phase_split_dual_gpu.py">

<violation number="1" location="dflash/scripts/phase_split_dual_gpu.py:441">
P2: Failed target runs can crash the harness during output parsing before per-case failure aggregation runs.</violation>

<violation number="2" location="dflash/scripts/phase_split_dual_gpu.py:609">
P2: Legacy `resource_summary` is keyed by the logical GPU index, so visible-device remapping can make it fall back to zero samples even though monitoring data exists under the physical device key.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread dflash/scripts/phase_split_dual_gpu.py Outdated
Comment thread dflash/scripts/phase_split_dual_gpu.py Outdated
#endif
}

static int run_dflash_draft_ipc_daemon(const char * draft_path,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we abstract interface for draft? so we can have in-context (same gpu), two gpus, two gpus with amd/nvidia. put three different implementation into three files with some common/shared functions.

this file is too large to maintain now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR, I kept the change focused on validating the CUDA/HIP mixed backend path at the existing bench harness boundary. Pulling the draft interface refactor into the same patch would be a larger cleanup and would make the validation surface larger. But yes it is a good follow-up direction.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest do the refectoring first then add HIP mixed backend support.

@@ -0,0 +1,39 @@
#pragma once
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not put under hip_compat?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it outside hip_compat because it is not a HIP-only shim. The hip_compat directory is currently used for vendored ggml HIP header compatibility, such as cuda_fp16.h. This header is our own runtime compatibility layer for the harness: CUDA builds include cuda_runtime.h through it, while HIP builds map the existing cuda* runtime spellings to hip*.

@weicj weicj force-pushed the bench-cuda-hip-mixed-backend-placement branch from c024951 to c5ffdb3 Compare May 7, 2026 13:02
Comment thread dflash/docs/SPEC_PREFILL.md Outdated
See `src/flashprefill.h` for the full list and defaults.

## Dual-GPU PFlash phase split
## Hybrid PFlash phase split
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why override? but overall this file is not well structured. maybe you can replace a good user instruction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. let me move the hybrid CUDA/HIP placement instructions into a dedicated docs/MIXED_BACKEND.md and leave only a short pointer from SPEC_PREFILL.md. That should make it easier to check and follow. :)

int batch, int seq_len, int n_q_heads, int n_k_heads, int head_dim,
float scale,
int elem_size,
ggml_type qkv_type,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good approach!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants