bench(dflash,pflash): add CUDA/HIP mixed backend placement#122
bench(dflash,pflash): add CUDA/HIP mixed backend placement#122weicj wants to merge 2 commits intoLuce-Org:mainfrom
Conversation
There was a problem hiding this comment.
2 issues found across 12 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/scripts/phase_split_dual_gpu.py">
<violation number="1" location="dflash/scripts/phase_split_dual_gpu.py:441">
P2: Failed target runs can crash the harness during output parsing before per-case failure aggregation runs.</violation>
<violation number="2" location="dflash/scripts/phase_split_dual_gpu.py:609">
P2: Legacy `resource_summary` is keyed by the logical GPU index, so visible-device remapping can make it fall back to zero samples even though monitoring data exists under the physical device key.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| #endif | ||
| } | ||
|
|
||
| static int run_dflash_draft_ipc_daemon(const char * draft_path, |
There was a problem hiding this comment.
can we abstract interface for draft? so we can have in-context (same gpu), two gpus, two gpus with amd/nvidia. put three different implementation into three files with some common/shared functions.
this file is too large to maintain now.
There was a problem hiding this comment.
For this PR, I kept the change focused on validating the CUDA/HIP mixed backend path at the existing bench harness boundary. Pulling the draft interface refactor into the same patch would be a larger cleanup and would make the validation surface larger. But yes it is a good follow-up direction.
There was a problem hiding this comment.
suggest do the refectoring first then add HIP mixed backend support.
| @@ -0,0 +1,39 @@ | |||
| #pragma once | |||
There was a problem hiding this comment.
why not put under hip_compat?
There was a problem hiding this comment.
I kept it outside hip_compat because it is not a HIP-only shim. The hip_compat directory is currently used for vendored ggml HIP header compatibility, such as cuda_fp16.h. This header is our own runtime compatibility layer for the harness: CUDA builds include cuda_runtime.h through it, while HIP builds map the existing cuda* runtime spellings to hip*.
c024951 to
c5ffdb3
Compare
| See `src/flashprefill.h` for the full list and defaults. | ||
|
|
||
| ## Dual-GPU PFlash phase split | ||
| ## Hybrid PFlash phase split |
There was a problem hiding this comment.
why override? but overall this file is not well structured. maybe you can replace a good user instruction.
There was a problem hiding this comment.
Agreed. let me move the hybrid CUDA/HIP placement instructions into a dedicated docs/MIXED_BACKEND.md and leave only a short pointer from SPEC_PREFILL.md. That should make it easier to check and follow. :)
| int batch, int seq_len, int n_q_heads, int n_k_heads, int head_dim, | ||
| float scale, | ||
| int elem_size, | ||
| ggml_type qkv_type, |
bench(dflash,pflash): add CUDA/HIP mixed backend placement
Summary
Add CUDA/HIP mixed backend placement support for PFlash and DFlash bench harness execution.
This PR makes the PFlash/DFlash harness split paths backend-portable instead of CUDA-only or multi-GPU-specific. The same implementation can be built as separate CUDA or HIP binaries, then combined by the harness when CUDA/HIP mixed backend placement is needed:
This PR keeps mixed placement in the bench/runtime harnesses first, so the multi-device and multi-backend behavior can be validated directly before the OpenAI-compatible server path is integrated in a follow-up PR.
Changes
DFLASH27B_GPU_BACKEND=cuda|hipto the DFlash CMake build.ggml-cudaorggml-hip).flash_attn_ext.ggml_typeinto the portable PFlash path so F16 and BF16 are not inferred from byte size.phase_split_dual_gpu.pywith:--pflash-backend cuda|hip--pflash-visible-devices--target-backend cuda|hip--target-visible-devicestest_dflash:--draft-ipc-daemon--draft-ipc-bin--draft-ipc-gpu--draft-ipc-work-dir--draft-ipc-ring-capbench_he.pyso HumanEval-style DFlash runs can pass through the draft IPC binary/options used for mixed-backend draft split validation.dflash/docs/MIXED_BACKEND.mdand keepSPEC_PREFILL.mdfocused on the spec-prefill daemon reference.CUDA/HIP Mixed Backend Design
PFlash and DFlash already have flexible split flows. This PR makes those flows explicit backend-placement points across separate CUDA/HIP builds instead of tying the harness to one GPU vendor or one fixed topology.
The mixed-backend boundary stays at phase/model/process level:
These split boundaries do not require moving GPU activations directly across vendors, so backend placement can be selected through build-time backend binaries plus runtime harness placement while cross-vendor target layer execution stays outside the design.
For PFlash, the boundary is host-side token/text data:
For DFlash, the boundary is a separate draft process:
This keeps mixed backend support compatible with single-device target runs and existing same-backend target layer split behavior while avoiding cross-vendor target layer execution.
Validation
Validation used mixed CUDA/HIP hardware to cover the cross-backend case. CUDA-only, HIP-only, single-device, and same-backend multi-device placements use the same CMake backend selector and harness plumbing, but the table below focuses on the cases that prove both backend directions:
sm_75), NVIDIA driver595.58.03, CUDA toolkit12.0.140(nvidia-smireports CUDA driver capability13.2).gfx906), ROCm7.2.1, HIP7.2.53211.gfx906)sm_75)sm_75)gfx906)gfx906)sm_75)sm_75)gfx906)Additional checks:
phase_split_dual_gpu.pywas exercised in both backend directions.bench_he.pywas exercised with mixed-backend DFlash draft IPC in both backend directions.