Skip to content

Initial support for gfx1151/strix halo#117

Open
a-huk wants to merge 5 commits intoLuce-Org:mainfrom
a-huk:main
Open

Initial support for gfx1151/strix halo#117
a-huk wants to merge 5 commits intoLuce-Org:mainfrom
a-huk:main

Conversation

@a-huk
Copy link
Copy Markdown

@a-huk a-huk commented May 7, 2026

Hi guys,

I saw this about a week ago on /r/LocalLLaMA, found it suepr cool, really gettign the most out of hte hardware.

The reason I was interested is that I have a strix halo machine to run local LLMs, but prompt processing at long prompts really kills it, taking dozens of minutes to finish prefill. So I thought, could I with Claude port it over?

I am a software dev, but nothing to do with GPU, so this was a rather new territory, but I think the ersult is somthing workable. I hope that it can be useful or at least speed up the actual port to Strix Halo.

I have tested with Unsloth's Qwen3.6-27B-UD-Q4_K_XL.gguf, z-lab/Qwen3.6-27B-DFlash and Qwen3-0.6B-BF16.gguf.

As I say, really vibecoded, but voila, let me attach benchamrks that Claude did :D

a-huk and others added 4 commits May 7, 2026 10:27
Phase 1-9 port of dflash/pflash from NVIDIA CUDA to AMD Strix Halo.

New files:
- dflash/src/flashprefill_kernels.hip.cu   rocWMMA sparse FA kernels (k1-k4 + transpose pre-pass)
- dflash/src/bsa_launcher_hip.cu           BSA launcher wrapping kernel 4, with K/V transpose opt
- dflash/src/f16_convert.hip.cu            BF16 conversion kernel
- dflash/hip_compat/                       CUDA→HIP shim headers
- dflash/test/bench_score.cpp              Multi-seq-length kernel benchmark

Modified:
- dflash/CMakeLists.txt                    HIP build target, rocWMMA, GGML-HIP integration
- dflash/src/flashprefill.cpp              BSA runtime toggle (DFLASH_FP_USE_BSA)
- dflash/src/qwen3_0p6b_drafter.h         norm_scores out-param for hidden-state scoring
- dflash/src/qwen3_0p6b_graph.cpp         ||h||² norm scoring block, NoPE tail-score fix
- dflash/src/qwen3_drafter.cpp            DFLASH_FP_NORM_WEIGHT blend, chunk debug logging
- pflash/pflash/dflash_client.py           boot_vram_mib tuned for 14.91 GiB model
- pflash/tests/bench_niah_cpp.py           enable_thinking=False fix, corrected model paths

Performance (Strix Halo, gfx1151, 256 GB/s bandwidth):
- Decode: 11.7 AR tok/s → 29 tok/s with DFlash spec decode (2.5×)
- PFlash compress at 32K: ~14.5s → ~9.8s (Phase 9 K/V transpose fix)
- PFlash compress at 128K: ~135s, NIAH 2/2 at keep=0.05
- bench_score BSA+transpose speedup: 3.1× at 32K, 4.2× at 64K vs dense
- docs/BENCHMARKS.md  Phase 1-9 timing results, NIAH accuracy, HumanEval numbers
- docs/PROGRESS.md    Phase-by-phase implementation log
- docs/SERVING.md     OpenAI server setup, keep_ratio tuning, troubleshooting
- scripts/start_server.sh  One-command server launch for Qwen3.6-27B + PFlash
Adds ThinkSplitter, a stateful streaming parser that routes tokens into
reasoning_content vs content as soon as the opening <think> tag is
detected — no buffering of the full thinking block required.

OpenAI path: streaming emits delta.reasoning_content / delta.content;
non-streaming adds reasoning_content to message.
Anthropic path: thinking becomes a type:thinking content block (index 0),
answer is type:text (index 1), matching the Extended Thinking wire format.
If no <think> tag is present everything flows as plain content unchanged.

Flag is on by default in start_server.sh.
…er_tools.py

Qwen3.6 with enable_thinking=True reasons freely inside <think> and then
generates tool calls in arbitrary XML (e.g. <Bash>) rather than the
<tool_call> format the chat template requires. Force enable_thinking=False
whenever tools are in the request; callers can override via
chat_template_kwargs if needed.

Add DFLASH_DEBUG_PROMPT env var to log rendered prompts and received tool
names to stderr for debugging.

Switch start_server.sh from server.py to server_tools.py so tool calling
is actually active.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 28 files

@smpurkis
Copy link
Copy Markdown

smpurkis commented May 7, 2026

You beat me to it! I just made a similar PR, #119

Phase 9 (K/V transpose) measured on Strix Halo:
- 64K: TTFT 73s→43s (1.7×), NIAH now 2/2 ✅ (was 0/2)
- 128K: TTFT 173s→96s (1.8×), NIAH 2/2 ✅

Gap vs RTX 3090 cut from ~5–7× to ~3–4×, near the
256/936 GB/s bandwidth-ratio floor. Also adds estimated
llama.cpp baseline comparison (~7× speedup at 64K/128K).
@a-huk
Copy link
Copy Markdown
Author

a-huk commented May 7, 2026

Metric 32K 64K 128K
FP step (drafter forward) ~5.0 s ~19.5 s ~49–50 s
Total compress ~9.8 s ~23.7 s ~59 s
Tail-score ~8.5 s
Target prefill ~15.7 s (3298 tok) ~33 s (6539 tok)
TTFT ~43 s ~95–96 s
NIAH 2/2 ✅ 2/2 ✅
vs RTX 3090 3.2× slower (ref 13.5 s) 3.9× slower (ref 24.8 s)

I think these numbers are in line with the 3090, as ther is about Strix Halo has about 3.7x slower bandwidth.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 issues found across 28 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="scripts/start_server.sh">

<violation number="1" location="scripts/start_server.sh:17">
P2: Hardcoded absolute paths make this startup script fail on hosts that do not match the author's local directory layout.</violation>
</file>

<file name="dflash/hip_compat/cuda_bf16.h">

<violation number="1" location="dflash/hip_compat/cuda_bf16.h:64">
P2: Host-only `__float2bfloat16_rn` incorrectly truncates instead of rounding to nearest-even.</violation>
</file>

<file name="docs/PROGRESS.md">

<violation number="1" location="docs/PROGRESS.md:90">
P2: The Phase 4 build command sets env vars before `podman exec` instead of passing them into the container, so `bash -c`/`cmake` will not reliably see `ROCM_PATH`, `CC`, or `CXX`.</violation>
</file>

<file name="dflash/scripts/server_tools.py">

<violation number="1" location="dflash/scripts/server_tools.py:478">
P2: Tool prompts are rendered with thinking disabled, but response parsing still defaults to thinking-enabled, so non-`<think>` tool responses can be misclassified as reasoning and lose assistant content.</violation>
</file>

<file name="dflash/src/qwen3_drafter.cpp">

<violation number="1" location="dflash/src/qwen3_drafter.cpp:106">
P2: Unconditionally computing `norm_scores` adds avoidable GPU work and host transfer on every drafter call, even when norm blending is disabled.</violation>
</file>

<file name="dflash/src/qwen3_0p6b_graph.cpp">

<violation number="1" location="dflash/src/qwen3_0p6b_graph.cpp:170">
P1: Adds a second full per-layer K cache, pushing the long-context memory footprint beyond the documented 24GB budget.</violation>

<violation number="2" location="dflash/src/qwen3_0p6b_graph.cpp:443">
P2: Norm-score graph allocation/init failures are unchecked, so OOM can flow into compute/get calls instead of returning false.</violation>
</file>

<file name="dflash/src/flashprefill_kernels.hip.cu">

<violation number="1" location="dflash/src/flashprefill_kernels.hip.cu:73">
P1: Public launch wrappers silently no-op on unsupported shapes instead of reporting an error, which can leave the attention pipeline with untouched/uninitialized outputs.</violation>

<violation number="2" location="dflash/src/flashprefill_kernels.hip.cu:108">
P1: Tail rows are bounded by block count, not seq_len, so the last partial block can read past Q.</violation>
</file>

<file name="dflash/src/bsa_launcher_hip.cu">

<violation number="1" location="dflash/src/bsa_launcher_hip.cu:77">
P1: Unchecked hipMalloc failures can leave the persistent transpose buffers invalid while the function still returns success.</violation>
</file>

<file name="dflash/test/bench_score.cpp">

<violation number="1" location="dflash/test/bench_score.cpp:10">
P1: Local API re-declaration uses a different `FlashPrefillConfig` layout than the real header, so the benchmark passes an incompatible config object to `flash_prefill_forward_bf16`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il]) ||
!make_pers(w.backend, half_type, 3, d_kv, V_curr_v[il]) ||
if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il]) ||
!make_pers(w.backend, half_type, 3, d_kv, K_norope_v[il]) ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Adds a second full per-layer K cache, pushing the long-context memory footprint beyond the documented 24GB budget.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3_0p6b_graph.cpp, line 170:

<comment>Adds a second full per-layer K cache, pushing the long-context memory footprint beyond the documented 24GB budget.</comment>

<file context>
@@ -163,10 +166,11 @@ bool forward_qwen3_0p6b_drafter(
-            if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il]) ||
-                !make_pers(w.backend, half_type, 3, d_kv, V_curr_v[il]) ||
+            if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il])   ||
+                !make_pers(w.backend, half_type, 3, d_kv, K_norope_v[il]) ||
+                !make_pers(w.backend, half_type, 3, d_kv, V_curr_v[il])   ||
                 !make_pers(w.backend, GGML_TYPE_F32, 3, d_ql, Q_last_v[il])) {
</file context>

@@ -0,0 +1,704 @@
// ROCm/HIP port of flashprefill_kernels.cu — all 4 kernels.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Public launch wrappers silently no-op on unsupported shapes instead of reporting an error, which can leave the attention pipeline with untouched/uninitialized outputs.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/flashprefill_kernels.hip.cu, line 73:

<comment>Public launch wrappers silently no-op on unsupported shapes instead of reporting an error, which can leave the attention pipeline with untouched/uninitialized outputs.</comment>

<file context>
@@ -0,0 +1,704 @@
+    const int n_k_blocks = (seq_len + block_size - 1) / block_size;
+    dim3 grid(n_k_blocks, batch * n_kv_heads, 1);
+    dim3 block(head_dim, 1, 1);
+    if (head_dim == 128 && block_size == 128)
+        compute_mean_vector_kernel_bf16<128, 128><<<grid, block, 0, stream>>>(
+            (const hip_bfloat16 *)K, (hip_bfloat16 *)mean_K,
</file context>

const int kh = qh * n_k_heads / n_q_heads;
const int tid = threadIdx.x;
const int q_row_global = q_block_idx * BLOCK + tid;
if (tid >= BLOCK || q_row_global >= q_block_idx_max * BLOCK) return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Tail rows are bounded by block count, not seq_len, so the last partial block can read past Q.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/flashprefill_kernels.hip.cu, line 108:

<comment>Tail rows are bounded by block count, not seq_len, so the last partial block can read past Q.</comment>

<file context>
@@ -0,0 +1,704 @@
+    const int kh           = qh * n_k_heads / n_q_heads;
+    const int tid          = threadIdx.x;
+    const int q_row_global = q_block_idx * BLOCK + tid;
+    if (tid >= BLOCK || q_row_global >= q_block_idx_max * BLOCK) return;
+
+    const hip_bfloat16 * Qp = Q + (size_t)b * s_Q_b
</file context>

@@ -0,0 +1,104 @@
// HIP Phase 3: BSA-compatible launcher wrapping our rocWMMA sparse FA kernel.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Unchecked hipMalloc failures can leave the persistent transpose buffers invalid while the function still returns success.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/bsa_launcher_hip.cu, line 77:

<comment>Unchecked hipMalloc failures can leave the persistent transpose buffers invalid while the function still returns success.</comment>

<file context>
@@ -0,0 +1,104 @@
+    if (kv_bytes > kv_buf_cap) {
+        if (kv_buf_K) hipFree(kv_buf_K);
+        if (kv_buf_V) hipFree(kv_buf_V);
+        hipMalloc(&kv_buf_K, kv_bytes);
+        hipMalloc(&kv_buf_V, kv_bytes);
+        kv_buf_cap = kv_bytes;
</file context>

@@ -0,0 +1,58 @@
// Standalone timing of flash_prefill_forward_bf16 at multiple seq lengths.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Local API re-declaration uses a different FlashPrefillConfig layout than the real header, so the benchmark passes an incompatible config object to flash_prefill_forward_bf16.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/bench_score.cpp, line 10:

<comment>Local API re-declaration uses a different `FlashPrefillConfig` layout than the real header, so the benchmark passes an incompatible config object to `flash_prefill_forward_bf16`.</comment>

<file context>
@@ -0,0 +1,58 @@
+#include <hip/hip_runtime.h>
+#include <hip/hip_bfloat16.h>
+
+namespace dflash27b { namespace flashprefill {
+struct FlashPrefillConfig {
+    int block_size = 128;
</file context>

return r;
}
inline hip_bfloat16 __float2bfloat16_rn(float x) {
return __float2bfloat16(x);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Host-only __float2bfloat16_rn incorrectly truncates instead of rounding to nearest-even.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/hip_compat/cuda_bf16.h, line 64:

<comment>Host-only `__float2bfloat16_rn` incorrectly truncates instead of rounding to nearest-even.</comment>

<file context>
@@ -0,0 +1,66 @@
+    return r;
+}
+inline hip_bfloat16 __float2bfloat16_rn(float x) {
+    return __float2bfloat16(x);
+}
+#endif
</file context>

Comment thread docs/PROGRESS.md
### Phase 4 build (complete — use this)
```bash
podman start vllm
ROCM_PATH=/opt/rocm CC=gcc CXX=g++ podman exec -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c "
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The Phase 4 build command sets env vars before podman exec instead of passing them into the container, so bash -c/cmake will not reliably see ROCM_PATH, CC, or CXX.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/PROGRESS.md, line 90:

<comment>The Phase 4 build command sets env vars before `podman exec` instead of passing them into the container, so `bash -c`/`cmake` will not reliably see `ROCM_PATH`, `CC`, or `CXX`.</comment>

<file context>
@@ -0,0 +1,153 @@
+### Phase 4 build (complete — use this)
+```bash
+podman start vllm
+ROCM_PATH=/opt/rocm CC=gcc CXX=g++ podman exec -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c "
+cmake -S . -B build-hip-phase2 -DDFLASH27B_USE_HIP=ON -DDFLASH27B_HIP_SM80_EQUIV=ON \
+  -DCMAKE_HIP_COMPILER=/opt/venv/lib/python3.12/site-packages/_rocm_sdk_core/lib/llvm/bin/clang++
</file context>
Suggested change
ROCM_PATH=/opt/rocm CC=gcc CXX=g++ podman exec -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c "
podman exec -e ROCM_PATH=/opt/rocm -e CC=gcc -e CXX=g++ -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c "

# (e.g. <Bash>) instead of the <tool_call> format the template requires.
# Force thinking off whenever tools are present unless the caller
# explicitly opts back in via chat_template_kwargs.
kwargs["enable_thinking"] = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Tool prompts are rendered with thinking disabled, but response parsing still defaults to thinking-enabled, so non-<think> tool responses can be misclassified as reasoning and lose assistant content.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server_tools.py, line 478:

<comment>Tool prompts are rendered with thinking disabled, but response parsing still defaults to thinking-enabled, so non-`<think>` tool responses can be misclassified as reasoning and lose assistant content.</comment>

<file context>
@@ -470,10 +470,18 @@ def _tokenize_prompt(req: ChatRequest) -> tuple[Path, bool]:
+            # (e.g. <Bash>) instead of the <tool_call> format the template requires.
+            # Force thinking off whenever tools are present unless the caller
+            # explicitly opts back in via chat_template_kwargs.
+            kwargs["enable_thinking"] = False
         # Per-request chat template knobs (e.g. enable_thinking, preserve_thinking).
         if req.chat_template_kwargs:
</file context>

std::vector<float> running_max;
if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max)) {
std::vector<float> norm_scores;
if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max, norm_scores)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Unconditionally computing norm_scores adds avoidable GPU work and host transfer on every drafter call, even when norm blending is disabled.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3_drafter.cpp, line 106:

<comment>Unconditionally computing `norm_scores` adds avoidable GPU work and host transfer on every drafter call, even when norm blending is disabled.</comment>

<file context>
@@ -102,7 +102,8 @@ std::vector<int32_t> drafter_score_and_compress(
     std::vector<float> running_max;
-    if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max)) {
+    std::vector<float> norm_scores;
+    if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max, norm_scores)) {
         return {};
     }
</file context>

// ggml_sum_rows reduces the hidden dimension → [1, S].
// Cheap: S*4 bytes GPU→CPU (512KB at 128K tokens).
{
ggml_init_params ip{};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Norm-score graph allocation/init failures are unchecked, so OOM can flow into compute/get calls instead of returning false.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3_0p6b_graph.cpp, line 443:

<comment>Norm-score graph allocation/init failures are unchecked, so OOM can flow into compute/get calls instead of returning false.</comment>

<file context>
@@ -426,6 +436,30 @@ bool forward_qwen3_0p6b_drafter(
+    // ggml_sum_rows reduces the hidden dimension → [1, S].
+    // Cheap: S*4 bytes GPU→CPU (512KB at 128K tokens).
+    {
+        ggml_init_params ip{};
+        ip.mem_size = ggml_tensor_overhead() * 8 + ggml_graph_overhead();
+        ip.no_alloc = true;
</file context>

@a-huk
Copy link
Copy Markdown
Author

a-huk commented May 7, 2026

You beat me to it! I just made a similar PR, #119

Timing is insane, nice to see the work you put in, looking forward to support :D

@davide221
Copy link
Copy Markdown
Contributor

@a-huk thanks for the great contribution! Can you check Cubic P1 diagnostics and rebase to avoid conflict?

smpurkis added a commit to smpurkis/lucebox-hub that referenced this pull request May 7, 2026
Imports rocWMMA-native flashprefill kernels (mean / score-GEMM /
select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON.
Phase 1 (default) keeps the ggml q8 fallback unchanged.

On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq
length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at
S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases,
output byte-identical.

Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP
in qwen3_0p6b_graph.cpp, so the new kernels were linked but never
reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80.

Also addresses cubic-dev-ai review on PR Luce-Org#119:
  - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading
        h_after)
  - P1: restore prefix-cache impls in qwen35_target_graph; drop the
        stub prefix_cache.cpp
  - P1: cap->conv_input view sizing
  - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP
  - P2: n_vocab guard in gguf_target_loader
  - P2: free_drafter on prewarm fail; portable env restore via lambda
        (Windows _dupenv_s/_putenv_s); thread pool_kernel through
        qwen35_score_and_compress
  - P2: drop broken pflash/.gitignore

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants