Initial support for gfx1151/strix halo#117
Conversation
Phase 1-9 port of dflash/pflash from NVIDIA CUDA to AMD Strix Halo. New files: - dflash/src/flashprefill_kernels.hip.cu rocWMMA sparse FA kernels (k1-k4 + transpose pre-pass) - dflash/src/bsa_launcher_hip.cu BSA launcher wrapping kernel 4, with K/V transpose opt - dflash/src/f16_convert.hip.cu BF16 conversion kernel - dflash/hip_compat/ CUDA→HIP shim headers - dflash/test/bench_score.cpp Multi-seq-length kernel benchmark Modified: - dflash/CMakeLists.txt HIP build target, rocWMMA, GGML-HIP integration - dflash/src/flashprefill.cpp BSA runtime toggle (DFLASH_FP_USE_BSA) - dflash/src/qwen3_0p6b_drafter.h norm_scores out-param for hidden-state scoring - dflash/src/qwen3_0p6b_graph.cpp ||h||² norm scoring block, NoPE tail-score fix - dflash/src/qwen3_drafter.cpp DFLASH_FP_NORM_WEIGHT blend, chunk debug logging - pflash/pflash/dflash_client.py boot_vram_mib tuned for 14.91 GiB model - pflash/tests/bench_niah_cpp.py enable_thinking=False fix, corrected model paths Performance (Strix Halo, gfx1151, 256 GB/s bandwidth): - Decode: 11.7 AR tok/s → 29 tok/s with DFlash spec decode (2.5×) - PFlash compress at 32K: ~14.5s → ~9.8s (Phase 9 K/V transpose fix) - PFlash compress at 128K: ~135s, NIAH 2/2 at keep=0.05 - bench_score BSA+transpose speedup: 3.1× at 32K, 4.2× at 64K vs dense
- docs/BENCHMARKS.md Phase 1-9 timing results, NIAH accuracy, HumanEval numbers - docs/PROGRESS.md Phase-by-phase implementation log - docs/SERVING.md OpenAI server setup, keep_ratio tuning, troubleshooting - scripts/start_server.sh One-command server launch for Qwen3.6-27B + PFlash
Adds ThinkSplitter, a stateful streaming parser that routes tokens into reasoning_content vs content as soon as the opening <think> tag is detected — no buffering of the full thinking block required. OpenAI path: streaming emits delta.reasoning_content / delta.content; non-streaming adds reasoning_content to message. Anthropic path: thinking becomes a type:thinking content block (index 0), answer is type:text (index 1), matching the Extended Thinking wire format. If no <think> tag is present everything flows as plain content unchanged. Flag is on by default in start_server.sh.
…er_tools.py Qwen3.6 with enable_thinking=True reasons freely inside <think> and then generates tool calls in arbitrary XML (e.g. <Bash>) rather than the <tool_call> format the chat template requires. Force enable_thinking=False whenever tools are in the request; callers can override via chat_template_kwargs if needed. Add DFLASH_DEBUG_PROMPT env var to log rendered prompts and received tool names to stderr for debugging. Switch start_server.sh from server.py to server_tools.py so tool calling is actually active. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
You beat me to it! I just made a similar PR, #119 |
Phase 9 (K/V transpose) measured on Strix Halo: - 64K: TTFT 73s→43s (1.7×), NIAH now 2/2 ✅ (was 0/2) - 128K: TTFT 173s→96s (1.8×), NIAH 2/2 ✅ Gap vs RTX 3090 cut from ~5–7× to ~3–4×, near the 256/936 GB/s bandwidth-ratio floor. Also adds estimated llama.cpp baseline comparison (~7× speedup at 64K/128K).
I think these numbers are in line with the 3090, as ther is about Strix Halo has about 3.7x slower bandwidth. |
There was a problem hiding this comment.
11 issues found across 28 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="scripts/start_server.sh">
<violation number="1" location="scripts/start_server.sh:17">
P2: Hardcoded absolute paths make this startup script fail on hosts that do not match the author's local directory layout.</violation>
</file>
<file name="dflash/hip_compat/cuda_bf16.h">
<violation number="1" location="dflash/hip_compat/cuda_bf16.h:64">
P2: Host-only `__float2bfloat16_rn` incorrectly truncates instead of rounding to nearest-even.</violation>
</file>
<file name="docs/PROGRESS.md">
<violation number="1" location="docs/PROGRESS.md:90">
P2: The Phase 4 build command sets env vars before `podman exec` instead of passing them into the container, so `bash -c`/`cmake` will not reliably see `ROCM_PATH`, `CC`, or `CXX`.</violation>
</file>
<file name="dflash/scripts/server_tools.py">
<violation number="1" location="dflash/scripts/server_tools.py:478">
P2: Tool prompts are rendered with thinking disabled, but response parsing still defaults to thinking-enabled, so non-`<think>` tool responses can be misclassified as reasoning and lose assistant content.</violation>
</file>
<file name="dflash/src/qwen3_drafter.cpp">
<violation number="1" location="dflash/src/qwen3_drafter.cpp:106">
P2: Unconditionally computing `norm_scores` adds avoidable GPU work and host transfer on every drafter call, even when norm blending is disabled.</violation>
</file>
<file name="dflash/src/qwen3_0p6b_graph.cpp">
<violation number="1" location="dflash/src/qwen3_0p6b_graph.cpp:170">
P1: Adds a second full per-layer K cache, pushing the long-context memory footprint beyond the documented 24GB budget.</violation>
<violation number="2" location="dflash/src/qwen3_0p6b_graph.cpp:443">
P2: Norm-score graph allocation/init failures are unchecked, so OOM can flow into compute/get calls instead of returning false.</violation>
</file>
<file name="dflash/src/flashprefill_kernels.hip.cu">
<violation number="1" location="dflash/src/flashprefill_kernels.hip.cu:73">
P1: Public launch wrappers silently no-op on unsupported shapes instead of reporting an error, which can leave the attention pipeline with untouched/uninitialized outputs.</violation>
<violation number="2" location="dflash/src/flashprefill_kernels.hip.cu:108">
P1: Tail rows are bounded by block count, not seq_len, so the last partial block can read past Q.</violation>
</file>
<file name="dflash/src/bsa_launcher_hip.cu">
<violation number="1" location="dflash/src/bsa_launcher_hip.cu:77">
P1: Unchecked hipMalloc failures can leave the persistent transpose buffers invalid while the function still returns success.</violation>
</file>
<file name="dflash/test/bench_score.cpp">
<violation number="1" location="dflash/test/bench_score.cpp:10">
P1: Local API re-declaration uses a different `FlashPrefillConfig` layout than the real header, so the benchmark passes an incompatible config object to `flash_prefill_forward_bf16`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il]) || | ||
| !make_pers(w.backend, half_type, 3, d_kv, V_curr_v[il]) || | ||
| if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il]) || | ||
| !make_pers(w.backend, half_type, 3, d_kv, K_norope_v[il]) || |
There was a problem hiding this comment.
P1: Adds a second full per-layer K cache, pushing the long-context memory footprint beyond the documented 24GB budget.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3_0p6b_graph.cpp, line 170:
<comment>Adds a second full per-layer K cache, pushing the long-context memory footprint beyond the documented 24GB budget.</comment>
<file context>
@@ -163,10 +166,11 @@ bool forward_qwen3_0p6b_drafter(
- if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il]) ||
- !make_pers(w.backend, half_type, 3, d_kv, V_curr_v[il]) ||
+ if (!make_pers(w.backend, half_type, 3, d_kv, K_curr_v[il]) ||
+ !make_pers(w.backend, half_type, 3, d_kv, K_norope_v[il]) ||
+ !make_pers(w.backend, half_type, 3, d_kv, V_curr_v[il]) ||
!make_pers(w.backend, GGML_TYPE_F32, 3, d_ql, Q_last_v[il])) {
</file context>
| @@ -0,0 +1,704 @@ | |||
| // ROCm/HIP port of flashprefill_kernels.cu — all 4 kernels. | |||
There was a problem hiding this comment.
P1: Public launch wrappers silently no-op on unsupported shapes instead of reporting an error, which can leave the attention pipeline with untouched/uninitialized outputs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/flashprefill_kernels.hip.cu, line 73:
<comment>Public launch wrappers silently no-op on unsupported shapes instead of reporting an error, which can leave the attention pipeline with untouched/uninitialized outputs.</comment>
<file context>
@@ -0,0 +1,704 @@
+ const int n_k_blocks = (seq_len + block_size - 1) / block_size;
+ dim3 grid(n_k_blocks, batch * n_kv_heads, 1);
+ dim3 block(head_dim, 1, 1);
+ if (head_dim == 128 && block_size == 128)
+ compute_mean_vector_kernel_bf16<128, 128><<<grid, block, 0, stream>>>(
+ (const hip_bfloat16 *)K, (hip_bfloat16 *)mean_K,
</file context>
| const int kh = qh * n_k_heads / n_q_heads; | ||
| const int tid = threadIdx.x; | ||
| const int q_row_global = q_block_idx * BLOCK + tid; | ||
| if (tid >= BLOCK || q_row_global >= q_block_idx_max * BLOCK) return; |
There was a problem hiding this comment.
P1: Tail rows are bounded by block count, not seq_len, so the last partial block can read past Q.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/flashprefill_kernels.hip.cu, line 108:
<comment>Tail rows are bounded by block count, not seq_len, so the last partial block can read past Q.</comment>
<file context>
@@ -0,0 +1,704 @@
+ const int kh = qh * n_k_heads / n_q_heads;
+ const int tid = threadIdx.x;
+ const int q_row_global = q_block_idx * BLOCK + tid;
+ if (tid >= BLOCK || q_row_global >= q_block_idx_max * BLOCK) return;
+
+ const hip_bfloat16 * Qp = Q + (size_t)b * s_Q_b
</file context>
| @@ -0,0 +1,104 @@ | |||
| // HIP Phase 3: BSA-compatible launcher wrapping our rocWMMA sparse FA kernel. | |||
There was a problem hiding this comment.
P1: Unchecked hipMalloc failures can leave the persistent transpose buffers invalid while the function still returns success.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/bsa_launcher_hip.cu, line 77:
<comment>Unchecked hipMalloc failures can leave the persistent transpose buffers invalid while the function still returns success.</comment>
<file context>
@@ -0,0 +1,104 @@
+ if (kv_bytes > kv_buf_cap) {
+ if (kv_buf_K) hipFree(kv_buf_K);
+ if (kv_buf_V) hipFree(kv_buf_V);
+ hipMalloc(&kv_buf_K, kv_bytes);
+ hipMalloc(&kv_buf_V, kv_bytes);
+ kv_buf_cap = kv_bytes;
</file context>
| @@ -0,0 +1,58 @@ | |||
| // Standalone timing of flash_prefill_forward_bf16 at multiple seq lengths. | |||
There was a problem hiding this comment.
P1: Local API re-declaration uses a different FlashPrefillConfig layout than the real header, so the benchmark passes an incompatible config object to flash_prefill_forward_bf16.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/test/bench_score.cpp, line 10:
<comment>Local API re-declaration uses a different `FlashPrefillConfig` layout than the real header, so the benchmark passes an incompatible config object to `flash_prefill_forward_bf16`.</comment>
<file context>
@@ -0,0 +1,58 @@
+#include <hip/hip_runtime.h>
+#include <hip/hip_bfloat16.h>
+
+namespace dflash27b { namespace flashprefill {
+struct FlashPrefillConfig {
+ int block_size = 128;
</file context>
| return r; | ||
| } | ||
| inline hip_bfloat16 __float2bfloat16_rn(float x) { | ||
| return __float2bfloat16(x); |
There was a problem hiding this comment.
P2: Host-only __float2bfloat16_rn incorrectly truncates instead of rounding to nearest-even.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/hip_compat/cuda_bf16.h, line 64:
<comment>Host-only `__float2bfloat16_rn` incorrectly truncates instead of rounding to nearest-even.</comment>
<file context>
@@ -0,0 +1,66 @@
+ return r;
+}
+inline hip_bfloat16 __float2bfloat16_rn(float x) {
+ return __float2bfloat16(x);
+}
+#endif
</file context>
| ### Phase 4 build (complete — use this) | ||
| ```bash | ||
| podman start vllm | ||
| ROCM_PATH=/opt/rocm CC=gcc CXX=g++ podman exec -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c " |
There was a problem hiding this comment.
P2: The Phase 4 build command sets env vars before podman exec instead of passing them into the container, so bash -c/cmake will not reliably see ROCM_PATH, CC, or CXX.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/PROGRESS.md, line 90:
<comment>The Phase 4 build command sets env vars before `podman exec` instead of passing them into the container, so `bash -c`/`cmake` will not reliably see `ROCM_PATH`, `CC`, or `CXX`.</comment>
<file context>
@@ -0,0 +1,153 @@
+### Phase 4 build (complete — use this)
+```bash
+podman start vllm
+ROCM_PATH=/opt/rocm CC=gcc CXX=g++ podman exec -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c "
+cmake -S . -B build-hip-phase2 -DDFLASH27B_USE_HIP=ON -DDFLASH27B_HIP_SM80_EQUIV=ON \
+ -DCMAKE_HIP_COMPILER=/opt/venv/lib/python3.12/site-packages/_rocm_sdk_core/lib/llvm/bin/clang++
</file context>
| ROCM_PATH=/opt/rocm CC=gcc CXX=g++ podman exec -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c " | |
| podman exec -e ROCM_PATH=/opt/rocm -e CC=gcc -e CXX=g++ -w /home/hukad/specprefill/lucebox-hub/dflash vllm bash -c " |
| # (e.g. <Bash>) instead of the <tool_call> format the template requires. | ||
| # Force thinking off whenever tools are present unless the caller | ||
| # explicitly opts back in via chat_template_kwargs. | ||
| kwargs["enable_thinking"] = False |
There was a problem hiding this comment.
P2: Tool prompts are rendered with thinking disabled, but response parsing still defaults to thinking-enabled, so non-<think> tool responses can be misclassified as reasoning and lose assistant content.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server_tools.py, line 478:
<comment>Tool prompts are rendered with thinking disabled, but response parsing still defaults to thinking-enabled, so non-`<think>` tool responses can be misclassified as reasoning and lose assistant content.</comment>
<file context>
@@ -470,10 +470,18 @@ def _tokenize_prompt(req: ChatRequest) -> tuple[Path, bool]:
+ # (e.g. <Bash>) instead of the <tool_call> format the template requires.
+ # Force thinking off whenever tools are present unless the caller
+ # explicitly opts back in via chat_template_kwargs.
+ kwargs["enable_thinking"] = False
# Per-request chat template knobs (e.g. enable_thinking, preserve_thinking).
if req.chat_template_kwargs:
</file context>
| std::vector<float> running_max; | ||
| if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max)) { | ||
| std::vector<float> norm_scores; | ||
| if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max, norm_scores)) { |
There was a problem hiding this comment.
P2: Unconditionally computing norm_scores adds avoidable GPU work and host transfer on every drafter call, even when norm blending is disabled.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3_drafter.cpp, line 106:
<comment>Unconditionally computing `norm_scores` adds avoidable GPU work and host transfer on every drafter call, even when norm blending is disabled.</comment>
<file context>
@@ -102,7 +102,8 @@ std::vector<int32_t> drafter_score_and_compress(
std::vector<float> running_max;
- if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max)) {
+ std::vector<float> norm_scores;
+ if (!forward_qwen3_0p6b_drafter(ctx.weights, ids, n_lookahead, running_max, norm_scores)) {
return {};
}
</file context>
| // ggml_sum_rows reduces the hidden dimension → [1, S]. | ||
| // Cheap: S*4 bytes GPU→CPU (512KB at 128K tokens). | ||
| { | ||
| ggml_init_params ip{}; |
There was a problem hiding this comment.
P2: Norm-score graph allocation/init failures are unchecked, so OOM can flow into compute/get calls instead of returning false.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3_0p6b_graph.cpp, line 443:
<comment>Norm-score graph allocation/init failures are unchecked, so OOM can flow into compute/get calls instead of returning false.</comment>
<file context>
@@ -426,6 +436,30 @@ bool forward_qwen3_0p6b_drafter(
+ // ggml_sum_rows reduces the hidden dimension → [1, S].
+ // Cheap: S*4 bytes GPU→CPU (512KB at 128K tokens).
+ {
+ ggml_init_params ip{};
+ ip.mem_size = ggml_tensor_overhead() * 8 + ggml_graph_overhead();
+ ip.no_alloc = true;
</file context>
Timing is insane, nice to see the work you put in, looking forward to support :D |
|
@a-huk thanks for the great contribution! Can you check Cubic P1 diagnostics and rebase to avoid conflict? |
Imports rocWMMA-native flashprefill kernels (mean / score-GEMM / select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON. Phase 1 (default) keeps the ggml q8 fallback unchanged. On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases, output byte-identical. Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP in qwen3_0p6b_graph.cpp, so the new kernels were linked but never reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80. Also addresses cubic-dev-ai review on PR Luce-Org#119: - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading h_after) - P1: restore prefix-cache impls in qwen35_target_graph; drop the stub prefix_cache.cpp - P1: cap->conv_input view sizing - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP - P2: n_vocab guard in gguf_target_loader - P2: free_drafter on prewarm fail; portable env restore via lambda (Windows _dupenv_s/_putenv_s); thread pool_kernel through qwen35_score_and_compress - P2: drop broken pflash/.gitignore Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hi guys,
I saw this about a week ago on /r/LocalLLaMA, found it suepr cool, really gettign the most out of hte hardware.
The reason I was interested is that I have a strix halo machine to run local LLMs, but prompt processing at long prompts really kills it, taking dozens of minutes to finish prefill. So I thought, could I with Claude port it over?
I am a software dev, but nothing to do with GPU, so this was a rather new territory, but I think the ersult is somthing workable. I hope that it can be useful or at least speed up the actual port to Strix Halo.
I have tested with Unsloth's Qwen3.6-27B-UD-Q4_K_XL.gguf, z-lab/Qwen3.6-27B-DFlash and Qwen3-0.6B-BF16.gguf.
As I say, really vibecoded, but voila, let me attach benchamrks that Claude did :D