Initial support for gfx1151/strix halo by a-huk · Pull Request #117 · Luce-Org/lucebox-hub

a-huk · 2026-05-07T10:28:43Z

Hi guys,

I saw this about a week ago on /r/LocalLLaMA, found it suepr cool, really gettign the most out of hte hardware.

The reason I was interested is that I have a strix halo machine to run local LLMs, but prompt processing at long prompts really kills it, taking dozens of minutes to finish prefill. So I thought, could I with Claude port it over?

I am a software dev, but nothing to do with GPU, so this was a rather new territory, but I think the ersult is somthing workable. I hope that it can be useful or at least speed up the actual port to Strix Halo.

I have tested with Unsloth's Qwen3.6-27B-UD-Q4_K_XL.gguf, z-lab/Qwen3.6-27B-DFlash and Qwen3-0.6B-BF16.gguf.

As I say, really vibecoded, but voila, let me attach benchamrks that Claude did :D

Phase 1-9 port of dflash/pflash from NVIDIA CUDA to AMD Strix Halo. New files: - dflash/src/flashprefill_kernels.hip.cu rocWMMA sparse FA kernels (k1-k4 + transpose pre-pass) - dflash/src/bsa_launcher_hip.cu BSA launcher wrapping kernel 4, with K/V transpose opt - dflash/src/f16_convert.hip.cu BF16 conversion kernel - dflash/hip_compat/ CUDA→HIP shim headers - dflash/test/bench_score.cpp Multi-seq-length kernel benchmark Modified: - dflash/CMakeLists.txt HIP build target, rocWMMA, GGML-HIP integration - dflash/src/flashprefill.cpp BSA runtime toggle (DFLASH_FP_USE_BSA) - dflash/src/qwen3_0p6b_drafter.h norm_scores out-param for hidden-state scoring - dflash/src/qwen3_0p6b_graph.cpp ||h||² norm scoring block, NoPE tail-score fix - dflash/src/qwen3_drafter.cpp DFLASH_FP_NORM_WEIGHT blend, chunk debug logging - pflash/pflash/dflash_client.py boot_vram_mib tuned for 14.91 GiB model - pflash/tests/bench_niah_cpp.py enable_thinking=False fix, corrected model paths Performance (Strix Halo, gfx1151, 256 GB/s bandwidth): - Decode: 11.7 AR tok/s → 29 tok/s with DFlash spec decode (2.5×) - PFlash compress at 32K: ~14.5s → ~9.8s (Phase 9 K/V transpose fix) - PFlash compress at 128K: ~135s, NIAH 2/2 at keep=0.05 - bench_score BSA+transpose speedup: 3.1× at 32K, 4.2× at 64K vs dense

- docs/BENCHMARKS.md Phase 1-9 timing results, NIAH accuracy, HumanEval numbers - docs/PROGRESS.md Phase-by-phase implementation log - docs/SERVING.md OpenAI server setup, keep_ratio tuning, troubleshooting - scripts/start_server.sh One-command server launch for Qwen3.6-27B + PFlash

Adds ThinkSplitter, a stateful streaming parser that routes tokens into reasoning_content vs content as soon as the opening <think> tag is detected — no buffering of the full thinking block required. OpenAI path: streaming emits delta.reasoning_content / delta.content; non-streaming adds reasoning_content to message. Anthropic path: thinking becomes a type:thinking content block (index 0), answer is type:text (index 1), matching the Extended Thinking wire format. If no <think> tag is present everything flows as plain content unchanged. Flag is on by default in start_server.sh.

…er_tools.py Qwen3.6 with enable_thinking=True reasons freely inside <think> and then generates tool calls in arbitrary XML (e.g. <Bash>) rather than the <tool_call> format the chat template requires. Force enable_thinking=False whenever tools are in the request; callers can override via chat_template_kwargs if needed. Add DFLASH_DEBUG_PROMPT env var to log rendered prompts and received tool names to stderr for debugging. Switch start_server.sh from server.py to server_tools.py so tool calling is actually active. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cubic-dev-ai

No issues found across 28 files

smpurkis · 2026-05-07T11:08:53Z

You beat me to it! I just made a similar PR, #119

Phase 9 (K/V transpose) measured on Strix Halo: - 64K: TTFT 73s→43s (1.7×), NIAH now 2/2 ✅ (was 0/2) - 128K: TTFT 173s→96s (1.8×), NIAH 2/2 ✅ Gap vs RTX 3090 cut from ~5–7× to ~3–4×, near the 256/936 GB/s bandwidth-ratio floor. Also adds estimated llama.cpp baseline comparison (~7× speedup at 64K/128K).

a-huk · 2026-05-07T12:22:54Z

Metric	32K	64K	128K
FP step (drafter forward)	~5.0 s	~19.5 s	~49–50 s
Total compress	~9.8 s	~23.7 s	~59 s
Tail-score	—	—	~8.5 s
Target prefill	—	~15.7 s (3298 tok)	~33 s (6539 tok)
TTFT	—	~43 s	~95–96 s
NIAH	—	2/2 ✅	2/2 ✅
vs RTX 3090	—	3.2× slower (ref 13.5 s)	3.9× slower (ref 24.8 s)

I think these numbers are in line with the 3090, as ther is about Strix Halo has about 3.7x slower bandwidth.

cubic-dev-ai

11 issues found across 28 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="scripts/start_server.sh">

<violation number="1" location="scripts/start_server.sh:17">
P2: Hardcoded absolute paths make this startup script fail on hosts that do not match the author's local directory layout.</violation>
</file>

<file name="dflash/hip_compat/cuda_bf16.h">

<violation number="1" location="dflash/hip_compat/cuda_bf16.h:64">
P2: Host-only `__float2bfloat16_rn` incorrectly truncates instead of rounding to nearest-even.</violation>
</file>

<file name="docs/PROGRESS.md">

<violation number="1" location="docs/PROGRESS.md:90">
P2: The Phase 4 build command sets env vars before `podman exec` instead of passing them into the container, so `bash -c`/`cmake` will not reliably see `ROCM_PATH`, `CC`, or `CXX`.</violation>
</file>

<file name="dflash/scripts/server_tools.py">

<violation number="1" location="dflash/scripts/server_tools.py:478">
P2: Tool prompts are rendered with thinking disabled, but response parsing still defaults to thinking-enabled, so non-`<think>` tool responses can be misclassified as reasoning and lose assistant content.</violation>
</file>

<file name="dflash/src/qwen3_drafter.cpp">

<violation number="1" location="dflash/src/qwen3_drafter.cpp:106">
P2: Unconditionally computing `norm_scores` adds avoidable GPU work and host transfer on every drafter call, even when norm blending is disabled.</violation>
</file>

<file name="dflash/src/qwen3_0p6b_graph.cpp">

<violation number="1" location="dflash/src/qwen3_0p6b_graph.cpp:170">
P1: Adds a second full per-layer K cache, pushing the long-context memory footprint beyond the documented 24GB budget.</violation>

<violation number="2" location="dflash/src/qwen3_0p6b_graph.cpp:443">
P2: Norm-score graph allocation/init failures are unchecked, so OOM can flow into compute/get calls instead of returning false.</violation>
</file>

<file name="dflash/src/flashprefill_kernels.hip.cu">

<violation number="1" location="dflash/src/flashprefill_kernels.hip.cu:73">
P1: Public launch wrappers silently no-op on unsupported shapes instead of reporting an error, which can leave the attention pipeline with untouched/uninitialized outputs.</violation>

<violation number="2" location="dflash/src/flashprefill_kernels.hip.cu:108">
P1: Tail rows are bounded by block count, not seq_len, so the last partial block can read past Q.</violation>
</file>

<file name="dflash/src/bsa_launcher_hip.cu">

<violation number="1" location="dflash/src/bsa_launcher_hip.cu:77">
P1: Unchecked hipMalloc failures can leave the persistent transpose buffers invalid while the function still returns success.</violation>
</file>

<file name="dflash/test/bench_score.cpp">

<violation number="1" location="dflash/test/bench_score.cpp:10">
P1: Local API re-declaration uses a different `FlashPrefillConfig` layout than the real header, so the benchmark passes an incompatible config object to `flash_prefill_forward_bf16`.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-05-07T12:22:59Z

-            if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il]) ||
-                !make_pers(w.backend, half_type, 3, d_kv, V_curr_v[il]) ||
+            if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il])   ||
+                !make_pers(w.backend, half_type, 3, d_kv, K_norope_v[il]) ||


P1: Adds a second full per-layer K cache, pushing the long-context memory footprint beyond the documented 24GB budget.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3_0p6b_graph.cpp, line 170: <comment>Adds a second full per-layer K cache, pushing the long-context memory footprint beyond the documented 24GB budget.</comment> <file context> @@ -163,10 +166,11 @@ bool forward_qwen3_0p6b_drafter( - if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il]) || - !make_pers(w.backend, half_type, 3, d_kv, V_curr_v[il]) || + if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il]) || + !make_pers(w.backend, half_type, 3, d_kv, K_norope_v[il]) || + !make_pers(w.backend, half_type, 3, d_kv, V_curr_v[il]) || !make_pers(w.backend, GGML_TYPE_F32, 3, d_ql, Q_last_v[il])) { </file context>

cubic-dev-ai · 2026-05-07T12:22:59Z

@@ -0,0 +1,704 @@
+// ROCm/HIP port of flashprefill_kernels.cu — all 4 kernels.


P1: Public launch wrappers silently no-op on unsupported shapes instead of reporting an error, which can leave the attention pipeline with untouched/uninitialized outputs.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/flashprefill_kernels.hip.cu, line 73: <comment>Public launch wrappers silently no-op on unsupported shapes instead of reporting an error, which can leave the attention pipeline with untouched/uninitialized outputs.</comment> <file context> @@ -0,0 +1,704 @@ + const int n_k_blocks = (seq_len + block_size - 1) / block_size; + dim3 grid(n_k_blocks, batch * n_kv_heads, 1); + dim3 block(head_dim, 1, 1); + if (head_dim == 128 && block_size == 128) + compute_mean_vector_kernel_bf16<128, 128><<<grid, block, 0, stream>>>( + (const hip_bfloat16 *)K, (hip_bfloat16 *)mean_K, </file context>

cubic-dev-ai · 2026-05-07T12:22:59Z

+    const int kh           = qh * n_k_heads / n_q_heads;
+    const int tid          = threadIdx.x;
+    const int q_row_global = q_block_idx * BLOCK + tid;
+    if (tid >= BLOCK || q_row_global >= q_block_idx_max * BLOCK) return;


P1: Tail rows are bounded by block count, not seq_len, so the last partial block can read past Q.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/flashprefill_kernels.hip.cu, line 108: <comment>Tail rows are bounded by block count, not seq_len, so the last partial block can read past Q.</comment> <file context> @@ -0,0 +1,704 @@ + const int kh = qh * n_k_heads / n_q_heads; + const int tid = threadIdx.x; + const int q_row_global = q_block_idx * BLOCK + tid; + if (tid >= BLOCK || q_row_global >= q_block_idx_max * BLOCK) return; + + const hip_bfloat16 * Qp = Q + (size_t)b * s_Q_b </file context>

cubic-dev-ai · 2026-05-07T12:22:59Z

@@ -0,0 +1,104 @@
+// HIP Phase 3: BSA-compatible launcher wrapping our rocWMMA sparse FA kernel.


P1: Unchecked hipMalloc failures can leave the persistent transpose buffers invalid while the function still returns success.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/bsa_launcher_hip.cu, line 77: <comment>Unchecked hipMalloc failures can leave the persistent transpose buffers invalid while the function still returns success.</comment> <file context> @@ -0,0 +1,104 @@ + if (kv_bytes > kv_buf_cap) { + if (kv_buf_K) hipFree(kv_buf_K); + if (kv_buf_V) hipFree(kv_buf_V); + hipMalloc(&kv_buf_K, kv_bytes); + hipMalloc(&kv_buf_V, kv_bytes); + kv_buf_cap = kv_bytes; </file context>

cubic-dev-ai · 2026-05-07T12:22:59Z

@@ -0,0 +1,58 @@
+// Standalone timing of flash_prefill_forward_bf16 at multiple seq lengths.


P1: Local API re-declaration uses a different FlashPrefillConfig layout than the real header, so the benchmark passes an incompatible config object to flash_prefill_forward_bf16.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/bench_score.cpp, line 10: <comment>Local API re-declaration uses a different `FlashPrefillConfig` layout than the real header, so the benchmark passes an incompatible config object to `flash_prefill_forward_bf16`.</comment> <file context> @@ -0,0 +1,58 @@ +#include <hip/hip_runtime.h> +#include <hip/hip_bfloat16.h> + +namespace dflash27b { namespace flashprefill { +struct FlashPrefillConfig { + int block_size = 128; </file context>

cubic-dev-ai · 2026-05-07T12:22:59Z

+    return r;
+}
+inline hip_bfloat16 __float2bfloat16_rn(float x) {
+    return __float2bfloat16(x);


P2: Host-only __float2bfloat16_rn incorrectly truncates instead of rounding to nearest-even.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/hip_compat/cuda_bf16.h, line 64: <comment>Host-only `__float2bfloat16_rn` incorrectly truncates instead of rounding to nearest-even.</comment> <file context> @@ -0,0 +1,66 @@ + return r; +} +inline hip_bfloat16 __float2bfloat16_rn(float x) { + return __float2bfloat16(x); +} +#endif </file context>

cubic-dev-ai · 2026-05-07T12:22:59Z

+### Phase 4 build (complete — use this)
+```bash
+podman start vllm
+ROCM_PATH=/opt/rocm CC=gcc CXX=g++ podman exec -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c "


P2: The Phase 4 build command sets env vars before podman exec instead of passing them into the container, so bash -c/cmake will not reliably see ROCM_PATH, CC, or CXX.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At docs/PROGRESS.md, line 90: <comment>The Phase 4 build command sets env vars before `podman exec` instead of passing them into the container, so `bash -c`/`cmake` will not reliably see `ROCM_PATH`, `CC`, or `CXX`.</comment> <file context> @@ -0,0 +1,153 @@ +### Phase 4 build (complete — use this) +```bash +podman start vllm +ROCM_PATH=/opt/rocm CC=gcc CXX=g++ podman exec -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c " +cmake -S . -B build-hip-phase2 -DDFLASH27B_USE_HIP=ON -DDFLASH27B_HIP_SM80_EQUIV=ON \ + -DCMAKE_HIP_COMPILER=/opt/venv/lib/python3.12/site-packages/_rocm_sdk_core/lib/llvm/bin/clang++ </file context>

Suggested change

ROCM_PATH=/opt/rocm CC=gcc CXX=g++ podman exec -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c "

podman exec -e ROCM_PATH=/opt/rocm -e CC=gcc -e CXX=g++ -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c "

cubic-dev-ai · 2026-05-07T12:22:59Z

+            # (e.g. <Bash>) instead of the <tool_call> format the template requires.
+            # Force thinking off whenever tools are present unless the caller
+            # explicitly opts back in via chat_template_kwargs.
+            kwargs["enable_thinking"] = False


P2: Tool prompts are rendered with thinking disabled, but response parsing still defaults to thinking-enabled, so non-<think> tool responses can be misclassified as reasoning and lose assistant content.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server_tools.py, line 478: <comment>Tool prompts are rendered with thinking disabled, but response parsing still defaults to thinking-enabled, so non-`<think>` tool responses can be misclassified as reasoning and lose assistant content.</comment> <file context> @@ -470,10 +470,18 @@ def _tokenize_prompt(req: ChatRequest) -> tuple[Path, bool]: + # (e.g. <Bash>) instead of the <tool_call> format the template requires. + # Force thinking off whenever tools are present unless the caller + # explicitly opts back in via chat_template_kwargs. + kwargs["enable_thinking"] = False # Per-request chat template knobs (e.g. enable_thinking, preserve_thinking). if req.chat_template_kwargs: </file context>

cubic-dev-ai · 2026-05-07T12:22:59Z

    std::vector<float> running_max;
-    if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max)) {
+    std::vector<float> norm_scores;
+    if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max, norm_scores)) {


P2: Unconditionally computing norm_scores adds avoidable GPU work and host transfer on every drafter call, even when norm blending is disabled.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3_drafter.cpp, line 106: <comment>Unconditionally computing `norm_scores` adds avoidable GPU work and host transfer on every drafter call, even when norm blending is disabled.</comment> <file context> @@ -102,7 +102,8 @@ std::vector<int32_t> drafter_score_and_compress( std::vector<float> running_max; - if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max)) { + std::vector<float> norm_scores; + if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max, norm_scores)) { return {}; } </file context>

cubic-dev-ai · 2026-05-07T12:22:59Z

+    // ggml_sum_rows reduces the hidden dimension → [1, S].
+    // Cheap: S*4 bytes GPU→CPU (512KB at 128K tokens).
+    {
+        ggml_init_params ip{};


P2: Norm-score graph allocation/init failures are unchecked, so OOM can flow into compute/get calls instead of returning false.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3_0p6b_graph.cpp, line 443: <comment>Norm-score graph allocation/init failures are unchecked, so OOM can flow into compute/get calls instead of returning false.</comment> <file context> @@ -426,6 +436,30 @@ bool forward_qwen3_0p6b_drafter( + // ggml_sum_rows reduces the hidden dimension → [1, S]. + // Cheap: S*4 bytes GPU→CPU (512KB at 128K tokens). + { + ggml_init_params ip{}; + ip.mem_size = ggml_tensor_overhead() * 8 + ggml_graph_overhead(); + ip.no_alloc = true; </file context>

a-huk · 2026-05-07T12:24:20Z

You beat me to it! I just made a similar PR, #119

Timing is insane, nice to see the work you put in, looking forward to support :D

davide221 · 2026-05-07T15:02:06Z

@a-huk thanks for the great contribution! Can you check Cubic P1 diagnostics and rebase to avoid conflict?

Imports rocWMMA-native flashprefill kernels (mean / score-GEMM / select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON. Phase 1 (default) keeps the ggml q8 fallback unchanged. On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases, output byte-identical. Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP in qwen3_0p6b_graph.cpp, so the new kernels were linked but never reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80. Also addresses cubic-dev-ai review on PR Luce-Org#119: - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading h_after) - P1: restore prefix-cache impls in qwen35_target_graph; drop the stub prefix_cache.cpp - P1: cap->conv_input view sizing - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP - P2: n_vocab guard in gguf_target_loader - P2: free_drafter on prewarm fail; portable env restore via lambda (Windows _dupenv_s/_putenv_s); thread pool_kernel through qwen35_score_and_compress - P2: drop broken pflash/.gitignore Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

a-huk and others added 4 commits May 7, 2026 10:27

cubic-dev-ai Bot reviewed May 7, 2026

View reviewed changes

		@@ -0,0 +1,704 @@
		// ROCm/HIP port of flashprefill_kernels.cu — all 4 kernels.

		@@ -0,0 +1,104 @@
		// HIP Phase 3: BSA-compatible launcher wrapping our rocWMMA sparse FA kernel.

		@@ -0,0 +1,58 @@
		// Standalone timing of flash_prefill_forward_bf16 at multiple seq lengths.

	ROCM_PATH=/opt/rocm CC=gcc CXX=g++ podman exec -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c "
	podman exec -e ROCM_PATH=/opt/rocm -e CC=gcc -e CXX=g++ -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c "

Conversation

a-huk commented May 7, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

smpurkis commented May 7, 2026

Uh oh!

a-huk commented May 7, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

a-huk commented May 7, 2026

Uh oh!

davide221 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants