Add HIP/ROCm support for Strix Halo (gfx1151)#119
Add HIP/ROCm support for Strix Halo (gfx1151)#119smpurkis wants to merge 7 commits intoLuce-Org:mainfrom
Conversation
This branch should only add HIP/ROCm support for the dflash C++ exact path (drafter scoring + qwen35-0.8b dispatch). The Python approx path (PromptCompressor + llama.cpp CLI wrapper + RULER comparison harness) landed in 28fd493 as part of "strix halo rocm compatibility" but is unrelated to the HIP port and is being removed. - git rm --cached the 8 approx-only files (kept on disk via .gitignore) - revert pflash/__init__.py, pyproject.toml, README.md to main (all diffs in those three were approx-related) - add pflash/.gitignore so the local files stay untracked - keep tracked: dflash_client.py (drafter_arch + ROCm VRAM telemetry), platform.py, bench_niah_cpp.py (--drafter-arch flag), all dflash/src
Strix Halo branch should minimize diff against main: only HIP/ROCm code support, no doc churn or branch-marker prose. - Untrack dflash/docs/STRIX_HALO_PFLASH.md via dflash/.gitignore (kept on disk; restore via `git show 388867a:dflash/docs/STRIX_HALO_PFLASH.md`) - Revert dflash/README.md, README.md, pflash/README.md to origin/main (the additions were Strix Halo prose; not load-bearing for the build)
Strip extras that landed in 28fd493 / 4a93505 alongside the actual HIP/ROCm port. Keeps the same end-to-end smoke test passing (qwen3-0.6B drafter + 27B target on gfx1151, NIAH single-needle). - pflash/pflash/platform.py removed; the only consumer was dflash_client's boot wait. Replaced with an inline _query_nvidia_vram_mib() that silently falls back to a 5s time-based check when nvidia-smi is absent (Strix Halo / any non-NVIDIA box). - dflash/src/flashprefill.cpp: keep only the cuda_runtime.h -> device_runtime.h header swap. Drop the per-device cudaMalloc scratch cache (CUDA-only perf opt; HIP path uses flash_prefill_forward_q8 so this code never runs on HIP). - dflash/src/qwen3_0p6b_loader.cpp: drop the src-vs-dst type/byte defensive checks (unrelated to HIP). - dflash/include/dflash27b.h: drop QWEN35_9B macro alt-config block (macro-guarded, no internal callers). - dflash/test/test_flashprefill_kernels.cpp: fully revert to origin/main. Gate test build on NOT DFLASH27B_USE_HIP in CMakeLists; this test exercises BF16 WMMA kernels that the HIP path force-disables anyway. - root .gitignore: drop dflash/build-hip and pflash/models entries. Net diff vs origin/main: 28 -> 20 files, +2376/-613 -> +1734/-587.
There was a problem hiding this comment.
9 issues found across 20 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/src/prefix_cache.cpp">
<violation number="1" location="dflash/src/prefix_cache.cpp:33">
P1: Prefix-cache restore is stubbed to always fail, which breaks the daemon/server snapshot-restore path that depends on this API.</violation>
</file>
<file name="pflash/.gitignore">
<violation number="1" location="pflash/.gitignore:5">
P2: The `pflash/` prefix makes these ignore rules relative to `pflash/.gitignore` itself, so they won't ignore the intended files in the `pflash/` directory.</violation>
</file>
<file name="dflash/CMakeLists.txt">
<violation number="1" location="dflash/CMakeLists.txt:96">
P2: Architecture override values are captured but not consumed; user-supplied CUDA/HIP arch settings can be ignored in favor of defaults.</violation>
</file>
<file name="dflash/src/qwen3_drafter.cpp">
<violation number="1" location="dflash/src/qwen3_drafter.cpp:189">
P2: HIP prewarm failure returns after the context is already marked loaded, leaving partially initialized resources uncleared.</violation>
<violation number="2" location="dflash/src/qwen3_drafter.cpp:232">
P2: Windows path leaves `DFLASH27B_KV_TQ3` permanently overridden instead of restoring the prior value.</violation>
<violation number="3" location="dflash/src/qwen3_drafter.cpp:615">
P2: Qwen3.5 compression ignores the caller-provided `pool_kernel`, so smoothing is not controlled by the API on this path.</violation>
</file>
<file name="dflash/src/qwen3_0p6b_graph.cpp">
<violation number="1" location="dflash/src/qwen3_0p6b_graph.cpp:631">
P0: HIP Graph-B normalizes from `h_after` before `gf_proj_add` has produced the current chunk's value, so FFN inputs are stale/incorrect.</violation>
</file>
<file name="dflash/src/gguf_target_loader.cpp">
<violation number="1" location="dflash/src/gguf_target_loader.cpp:534">
P2: `out.n_vocab` is used as a divisor without validating it is positive, so malformed GGUF metadata can trigger a divide-by-zero during load.</violation>
</file>
<file name="dflash/src/qwen35_target_graph.cpp">
<violation number="1" location="dflash/src/qwen35_target_graph.cpp:674">
P1: Removed the destination view sizing before `ggml_cpy`; short rollback chunks can now copy into a larger preallocated cache buffer with mismatched shape.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Imports rocWMMA-native flashprefill kernels (mean / score-GEMM / select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON. Phase 1 (default) keeps the ggml q8 fallback unchanged. On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases, output byte-identical. Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP in qwen3_0p6b_graph.cpp, so the new kernels were linked but never reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80. Also addresses cubic-dev-ai review on PR Luce-Org#119: - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading h_after) - P1: restore prefix-cache impls in qwen35_target_graph; drop the stub prefix_cache.cpp - P1: cap->conv_input view sizing - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP - P2: n_vocab guard in gguf_target_loader - P2: free_drafter on prewarm fail; portable env restore via lambda (Windows _dupenv_s/_putenv_s); thread pool_kernel through qwen35_score_and_compress - P2: drop broken pflash/.gitignore Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@smpurkis thanks for the great contribution! We can't wait to integrate AMD, can you check and fix P0-P1 from cubic? |
|
Thanks for the review! All P0–P2 from cubic addressed in commit f285bff. P0 — P1 — P1 — P2 also in:
Re-validated post-fix on gfx1151 / ROCm 7.2:
|
|
Nice work pushing the ROCm/HIP path forward. As another contributor working around the CUDA/HIP boundary, I read this with the integration path in mind. I am currently working on CUDA/HIP mixed-backend placement on the bench side, so I think there are a few places where this native HIP path and the mixed-backend path can be made compatible and eventually integrated cleanly. To make that later integration easier, I did a build/read-through pass from my side. In a local build-only check with ROCm 7.2.1, HIP Phase 1 configured and |
There was a problem hiding this comment.
I locally reproduced a HIP build failure here. With DFLASH27B_USE_HIP=ON, CMAKE_HIP_ARCHITECTURES=gfx906, and DFLASH27B_HIP_SM80_EQUIV=OFF, configure succeeds and test_dflash builds, but pflash_daemon fails at link time because this target still pulls ggml-cuda:
/usr/bin/ld: cannot find -lggml-cuda: No such file or directory
Could this link against the selected backend target, e.g. ${_dflash27b_ggml_backend_lib}, or be guarded as CUDA-only? spike_thin_copy appears to have the same direct ggml-cuda link.
| } | ||
|
|
||
| // rope dimension_sections (array of 4 uint32) | ||
| int rope_sections[4] = {0, 0, 0, 0}; |
There was a problem hiding this comment.
This still appears to weaken target-loader validation. Missing/short qwen35.rope.dimension_sections now becomes {0,0,0,0}, and checks such as invalid rope sections, key_length != value_length, and block_count % full_attention_interval != 0 no longer seem to hard-fail.
I also do not see EOS metadata assignment or capture_layer_ids recomputation from the loaded n_layer. For normal target generation, output.weight also still seems required when plan.load_output is true. Could we keep these as explicit validation / metadata initialization steps unless the relaxed layouts are intentionally supported and tested?
| if (kv_bytes > kv_buf_cap) { | ||
| if (kv_buf_K) hipFree(kv_buf_K); | ||
| if (kv_buf_V) hipFree(kv_buf_V); | ||
| hipMalloc(&kv_buf_K, kv_bytes); |
There was a problem hiding this comment.
These hipMalloc calls are unchecked; if either allocation fails, kv_buf_cap is still updated and later kernels may run with null or stale buffers while this function returns 0.
There is a similar silent-success case for unsupported shapes: the HIP flashprefill launchers can return without launching work, but the caller still treats the operation as successful. Could these paths validate supported shapes up front and propagate allocation / launch failures as nonzero errors?
Summary
HIP/ROCm compatibility for Strix Halo (AMD Ryzen AI MAX+ 395 / gfx1151) so dflash and pflash run on ROCm 7.2. Pure code-support — no CUDA behavior changes.
main, 6 commits.-DDFLASH27B_USE_HIP=ON.dflash/src/qwen3_0p6b_graph.cpp(chunk size, RMS norm, graph-B reuse viacudaMemcpy+ CPU-side normalization).dflash/src/qwen35_target_graph.cppgeneralized to back the 0.8B drafter as well as the 27B target — hardcoded constants removed.dflash/src/device_runtime.haliases CUDA→HIP types/symbols at the dflash layer..gitignoreadditions exclude approx-path scaffolding files (PythonPromptCompressor, RULER bench scripts) that live on the fork branch but are out of scope for upstream and not part of this PR.Llama.cpp dependency
The submodule needs three HIP fixes that are not yet on
Luce-Org/llama.cpp-dflash-ggml:luce-dflash. Companion PR: Luce-Org/llama.cpp-dflash-ggml#8 (cublas/cudaStream aliases + TQ3_0 FA guard). Until that lands, this branch's.gitmodulespoints atsmpurkis/llama.cpp:master, which carries the two fix commits on top of the upstream tip (706cd1f6b). After #8 is merged,.gitmodulescan be repointed atLuce-Org/llama.cpp-dflash-ggml:luce-dflashin a one-line follow-up.Build (Strix Halo / gfx1151 / ROCm 7.2)
Run
Verified
Test plan
-DDFLASH27B_USE_HIP=ONagainst the HIP-fixed submodulePre-existing CMake note (not in scope here)
Two
dflash/CMakeLists.txttargets onmainhardcodeggml-cudaand break a widecmake --build .on HIP:pflash_daemon(de31881)spike_thin_copy(b833dce)Not introduced by this PR; the canonical
--target ...set above sidesteps both. Worth a follow-up cleanup PR to use${_dflash27b_ggml_backend_lib}.