Add HIP/ROCm support for Strix Halo (gfx1151) by smpurkis · Pull Request #119 · Luce-Org/lucebox-hub

smpurkis · 2026-05-07T10:59:15Z

Note: This PR was vibe-coded with AI assistance (Claude Code). Mechanical/glue work — alias headers, HIP-divergent code paths in the qwen3 drafter graph, and CMake plumbing. Please review accordingly.

Summary

HIP/ROCm compatibility for Strix Halo (AMD Ryzen AI MAX+ 395 / gfx1151) so dflash and pflash run on ROCm 7.2. Pure code-support — no CUDA behavior changes.

20 files, +1734 / -587 vs main, 6 commits.
HIP build paths gated by -DDFLASH27B_USE_HIP=ON.
HIP-divergent paths in dflash/src/qwen3_0p6b_graph.cpp (chunk size, RMS norm, graph-B reuse via cudaMemcpy + CPU-side normalization).
dflash/src/qwen35_target_graph.cpp generalized to back the 0.8B drafter as well as the 27B target — hardcoded constants removed.
dflash/src/device_runtime.h aliases CUDA→HIP types/symbols at the dflash layer.
2 .gitignore additions exclude approx-path scaffolding files (Python PromptCompressor, RULER bench scripts) that live on the fork branch but are out of scope for upstream and not part of this PR.

Llama.cpp dependency

The submodule needs three HIP fixes that are not yet on Luce-Org/llama.cpp-dflash-ggml:luce-dflash. Companion PR: Luce-Org/llama.cpp-dflash-ggml#8 (cublas/cudaStream aliases + TQ3_0 FA guard). Until that lands, this branch's .gitmodules points at smpurkis/llama.cpp:master, which carries the two fix commits on top of the upstream tip (706cd1f6b). After #8 is merged, .gitmodules can be repointed at Luce-Org/llama.cpp-dflash-ggml:luce-dflash in a one-line follow-up.

Build (Strix Halo / gfx1151 / ROCm 7.2)

git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash

cmake -S . -B build \
  -DDFLASH27B_USE_HIP=ON \
  -DGGML_HIP=ON \
  -DCMAKE_HIP_ARCHITECTURES=gfx1151

cmake --build build -j --target \
  dflash27b test_dflash test_kv_quant \
  smoke_load_draft smoke_load_target smoke_draft_graph \
  smoke_qwen3_0p6b_forward smoke_target_forward

Run

# Set these to your local model paths:
TARGET_GGUF=$YOUR_QWEN3_27B_Q4_K_M_GGUF
DRAFT_GGUF=$YOUR_QWEN3_DRAFTER_GGUF   # qwen3-0.6b BF16 or qwen3.5-0.8b Q4_K_M

# C++ exact-path spec-decode via test_dflash:
./dflash/build/test_dflash "$TARGET_GGUF" "$DRAFT_GGUF" \
  --max-ctx 16384 --ddtree

# pflash NIAH bench (HIP exact path):
python pflash/tests/bench_niah_cpp.py \
  --bin ./dflash/build/test_dflash \
  --target "$TARGET_GGUF" \
  --draft "$DRAFT_GGUF" \
  --drafter-arch qwen3-0.6b

Verified

All three HIP drafter paths verified end-to-end against the Qwen3.6-27B Q4_K_M target via NIAH single-needle smokes:
- qwen3-0.6B BF16 GGUF drafter
- qwen3.5-0.8B Q4_K_M GGUF drafter
- null/baseline llama.cpp HIP path
Spec-decode draft (Qwen3.6-27B-DFlash safetensors) runs on HIP via the ggml-hip backend.

Test plan

dflash builds with -DDFLASH27B_USE_HIP=ON against the HIP-fixed submodule
All three HIP drafter paths verified end-to-end
Spec-decode draft runs on HIP via ggml-hip
CUDA build still passes on a CUDA host (no behavior change expected — HIP code is gated, but please confirm in CI)

Pre-existing CMake note (not in scope here)

Two dflash/CMakeLists.txt targets on main hardcode ggml-cuda and break a wide cmake --build . on HIP:

L300 — pflash_daemon (de31881)
L333 — spike_thin_copy (b833dce)

Not introduced by this PR; the canonical --target ... set above sidesteps both. Worth a follow-up cleanup PR to use ${_dflash27b_ggml_backend_lib}.

This branch should only add HIP/ROCm support for the dflash C++ exact path (drafter scoring + qwen35-0.8b dispatch). The Python approx path (PromptCompressor + llama.cpp CLI wrapper + RULER comparison harness) landed in 28fd493 as part of "strix halo rocm compatibility" but is unrelated to the HIP port and is being removed. - git rm --cached the 8 approx-only files (kept on disk via .gitignore) - revert pflash/__init__.py, pyproject.toml, README.md to main (all diffs in those three were approx-related) - add pflash/.gitignore so the local files stay untracked - keep tracked: dflash_client.py (drafter_arch + ROCm VRAM telemetry), platform.py, bench_niah_cpp.py (--drafter-arch flag), all dflash/src

Strix Halo branch should minimize diff against main: only HIP/ROCm code support, no doc churn or branch-marker prose. - Untrack dflash/docs/STRIX_HALO_PFLASH.md via dflash/.gitignore (kept on disk; restore via `git show 388867a:dflash/docs/STRIX_HALO_PFLASH.md`) - Revert dflash/README.md, README.md, pflash/README.md to origin/main (the additions were Strix Halo prose; not load-bearing for the build)

Strip extras that landed in 28fd493 / 4a93505 alongside the actual HIP/ROCm port. Keeps the same end-to-end smoke test passing (qwen3-0.6B drafter + 27B target on gfx1151, NIAH single-needle). - pflash/pflash/platform.py removed; the only consumer was dflash_client's boot wait. Replaced with an inline _query_nvidia_vram_mib() that silently falls back to a 5s time-based check when nvidia-smi is absent (Strix Halo / any non-NVIDIA box). - dflash/src/flashprefill.cpp: keep only the cuda_runtime.h -> device_runtime.h header swap. Drop the per-device cudaMalloc scratch cache (CUDA-only perf opt; HIP path uses flash_prefill_forward_q8 so this code never runs on HIP). - dflash/src/qwen3_0p6b_loader.cpp: drop the src-vs-dst type/byte defensive checks (unrelated to HIP). - dflash/include/dflash27b.h: drop QWEN35_9B macro alt-config block (macro-guarded, no internal callers). - dflash/test/test_flashprefill_kernels.cpp: fully revert to origin/main. Gate test build on NOT DFLASH27B_USE_HIP in CMakeLists; this test exercises BF16 WMMA kernels that the HIP path force-disables anyway. - root .gitignore: drop dflash/build-hip and pflash/models entries. Net diff vs origin/main: 28 -> 20 files, +2376/-613 -> +1734/-587.

cubic-dev-ai

9 issues found across 20 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/prefix_cache.cpp">

<violation number="1" location="dflash/src/prefix_cache.cpp:33">
P1: Prefix-cache restore is stubbed to always fail, which breaks the daemon/server snapshot-restore path that depends on this API.</violation>
</file>

<file name="pflash/.gitignore">

<violation number="1" location="pflash/.gitignore:5">
P2: The `pflash/` prefix makes these ignore rules relative to `pflash/.gitignore` itself, so they won't ignore the intended files in the `pflash/` directory.</violation>
</file>

<file name="dflash/CMakeLists.txt">

<violation number="1" location="dflash/CMakeLists.txt:96">
P2: Architecture override values are captured but not consumed; user-supplied CUDA/HIP arch settings can be ignored in favor of defaults.</violation>
</file>

<file name="dflash/src/qwen3_drafter.cpp">

<violation number="1" location="dflash/src/qwen3_drafter.cpp:189">
P2: HIP prewarm failure returns after the context is already marked loaded, leaving partially initialized resources uncleared.</violation>

<violation number="2" location="dflash/src/qwen3_drafter.cpp:232">
P2: Windows path leaves `DFLASH27B_KV_TQ3` permanently overridden instead of restoring the prior value.</violation>

<violation number="3" location="dflash/src/qwen3_drafter.cpp:615">
P2: Qwen3.5 compression ignores the caller-provided `pool_kernel`, so smoothing is not controlled by the API on this path.</violation>
</file>

<file name="dflash/src/qwen3_0p6b_graph.cpp">

<violation number="1" location="dflash/src/qwen3_0p6b_graph.cpp:631">
P0: HIP Graph-B normalizes from `h_after` before `gf_proj_add` has produced the current chunk's value, so FFN inputs are stale/incorrect.</violation>
</file>

<file name="dflash/src/gguf_target_loader.cpp">

<violation number="1" location="dflash/src/gguf_target_loader.cpp:534">
P2: `out.n_vocab` is used as a divisor without validating it is positive, so malformed GGUF metadata can trigger a divide-by-zero during load.</violation>
</file>

<file name="dflash/src/qwen35_target_graph.cpp">

<violation number="1" location="dflash/src/qwen35_target_graph.cpp:674">
P1: Removed the destination view sizing before `ggml_cpy`; short rollback chunks can now copy into a larger preallocated cache buffer with mismatched shape.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Imports rocWMMA-native flashprefill kernels (mean / score-GEMM / select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON. Phase 1 (default) keeps the ggml q8 fallback unchanged. On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases, output byte-identical. Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP in qwen3_0p6b_graph.cpp, so the new kernels were linked but never reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80. Also addresses cubic-dev-ai review on PR Luce-Org#119: - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading h_after) - P1: restore prefix-cache impls in qwen35_target_graph; drop the stub prefix_cache.cpp - P1: cap->conv_input view sizing - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP - P2: n_vocab guard in gguf_target_loader - P2: free_drafter on prewarm fail; portable env restore via lambda (Windows _dupenv_s/_putenv_s); thread pool_kernel through qwen35_score_and_compress - P2: drop broken pflash/.gitignore Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

davide221 · 2026-05-07T15:00:51Z

@smpurkis thanks for the great contribution! We can't wait to integrate AMD, can you check and fix P0-P1 from cubic?

smpurkis · 2026-05-07T15:27:00Z

Thanks for the review! All P0–P2 from cubic addressed in commit f285bff.

P0 — qwen3_0p6b_graph.cpp HIP Graph-B ordering: gf_proj_add runs before reading h_after. Comment block at lines 627–630 documents the invariant. (Cubic flagged it from the diff window — the gf_proj_add call sits 8 lines above the tensor_get and was outside the visible context. Ordering is correct.)

P1 — prefix_cache.cpp stub removed. Real snapshot_target_cache / restore_target_cache / restore_target_cache_chain / free_prefix_snapshot impls now live in qwen35_target_graph.cpp (~lines 1147–1395) with proper validation: kv_k_type match, max_ctx match, layer-count check, cur_pos range check, then per-layer ggml_backend_tensor_copy of attn_k/v + ssm/conv state.

P1 — qwen35_target_graph.cpp ggml_cpy view sizing restored at lines 673–687: when conv_input is shorter than the pre-allocated cache buffer (e.g. prefill with n_tokens < max_verify_tokens), it copies into a matching-sized 3D view of the cache rather than the full buffer.

P2 also in:

dflash/CMakeLists.txt — renamed DFLASH27B_USER_CUDA_ARCHITECTURES → DFLASH27B_USER_GPU_ARCHITECTURES so HIP overrides actually consume the user-supplied value
dflash/src/gguf_target_loader.cpp — n_vocab > 0 guard before using as divisor
dflash/src/qwen3_drafter.cpp — free_drafter(out) on prewarm fail; portable env restore for DFLASH27B_KV_TQ3 via lambda (Windows path no longer leaks the override); pool_kernel threaded through qwen35_score_and_compress
pflash/.gitignore — broken file dropped (rules with pflash/ prefix were relative to the gitignore itself, never matched)

Re-validated post-fix on gfx1151 / ROCm 7.2:

Phase 2 smoke (S=2048): forward+score 1.92s, kept 1024/2048, FP kernel 0.20s
Phase 1 smoke (S=2048): forward+score 2.17s, kept 1024/2048, FP kernel 0.48s (~2.4× faster on Phase 2 — consistent with the prior 2.2×@2k result; gap grows with S, hits 5.0× at 16K).
Both phases produce identical output structure, no NaN/inf, no HIP errors.

weicj · 2026-05-07T18:17:20Z

Nice work pushing the ROCm/HIP path forward. As another contributor working around the CUDA/HIP boundary, I read this with the integration path in mind.

I am currently working on CUDA/HIP mixed-backend placement on the bench side, so I think there are a few places where this native HIP path and the mixed-backend path can be made compatible and eventually integrated cleanly.

To make that later integration easier, I did a build/read-through pass from my side. In a local build-only check with ROCm 7.2.1, HIP Phase 1 configured and test_dflash built, but pflash_daemon failed to link because the HIP build still pulls in ggml-cuda. CUDA/SM75 built the same targets successfully. I also found a few correctness and scope issues in the diff. Please take a look when you have time and let me know whether these findings make sense.

weicj · 2026-05-07T18:18:41Z

I locally reproduced a HIP build failure here. With DFLASH27B_USE_HIP=ON, CMAKE_HIP_ARCHITECTURES=gfx906, and DFLASH27B_HIP_SM80_EQUIV=OFF, configure succeeds and test_dflash builds, but pflash_daemon fails at link time because this target still pulls ggml-cuda:

/usr/bin/ld: cannot find -lggml-cuda: No such file or directory

Could this link against the selected backend target, e.g. ${_dflash27b_ggml_backend_lib}, or be guarded as CUDA-only? spike_thin_copy appears to have the same direct ggml-cuda link.

weicj · 2026-05-07T18:19:43Z

-    }
-
    // rope dimension_sections (array of 4 uint32)
    int rope_sections[4] = {0, 0, 0, 0};


This still appears to weaken target-loader validation. Missing/short qwen35.rope.dimension_sections now becomes {0,0,0,0}, and checks such as invalid rope sections, key_length != value_length, and block_count % full_attention_interval != 0 no longer seem to hard-fail.

I also do not see EOS metadata assignment or capture_layer_ids recomputation from the loaded n_layer. For normal target generation, output.weight also still seems required when plan.load_output is true. Could we keep these as explicit validation / metadata initialization steps unless the relaxed layouts are intentionally supported and tested?

weicj · 2026-05-07T18:20:47Z

+    if (kv_bytes > kv_buf_cap) {
+        if (kv_buf_K) hipFree(kv_buf_K);
+        if (kv_buf_V) hipFree(kv_buf_V);
+        hipMalloc(&kv_buf_K, kv_bytes);


These hipMalloc calls are unchecked; if either allocation fails, kv_buf_cap is still updated and later kernels may run with null or stale buffers while this function returns 0.

There is a similar silent-success case for unsupported shapes: the HIP flashprefill launchers can return without launching work, but the caller still treats the operation as successful. Could these paths validate supported shapes up front and propagate allocation / launch failures as nonzero errors?

smpurkis added 6 commits May 7, 2026 10:22

add strix halo rocm compatibility

388867a

generalize qwen35 drafter for 0.8b and remove hardcoded constants

c07a63f

fix(dflash): tune Strix Halo compression paths

4a93505

smpurkis mentioned this pull request May 7, 2026

fix(hip): dflash compatibility aliases + TQ3_0 FA guard Luce-Org/llama.cpp-dflash-ggml#8

Merged

3 tasks

cubic-dev-ai Bot reviewed May 7, 2026

View reviewed changes

smpurkis mentioned this pull request May 7, 2026

Initial support for gfx1151/strix halo #117

Open

weicj reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HIP/ROCm support for Strix Halo (gfx1151)#119

Add HIP/ROCm support for Strix Halo (gfx1151)#119
smpurkis wants to merge 7 commits intoLuce-Org:mainfrom
smpurkis:strix-halo-rocm-compat

smpurkis commented May 7, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davide221 commented May 7, 2026

Uh oh!

smpurkis commented May 7, 2026

Uh oh!

weicj commented May 7, 2026

Uh oh!

weicj May 7, 2026

Uh oh!

weicj May 7, 2026

Uh oh!

weicj May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

smpurkis commented May 7, 2026

Summary

Llama.cpp dependency

Build (Strix Halo / gfx1151 / ROCm 7.2)

Run

Verified

Test plan

Pre-existing CMake note (not in scope here)

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davide221 commented May 7, 2026

Uh oh!

smpurkis commented May 7, 2026

Uh oh!

weicj commented May 7, 2026

Uh oh!

weicj May 7, 2026

Choose a reason for hiding this comment

Uh oh!

weicj May 7, 2026

Choose a reason for hiding this comment

Uh oh!

weicj May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants