Skip to content

feat(audio-car-cockpit): Improved ROCm GPU automatic detection and added CPU override#74

Open
ThomasGmeinder wants to merge 9 commits intoLiquid4All:mainfrom
ThomasGmeinder:ROCm_support
Open

feat(audio-car-cockpit): Improved ROCm GPU automatic detection and added CPU override#74
ThomasGmeinder wants to merge 9 commits intoLiquid4All:mainfrom
ThomasGmeinder:ROCm_support

Conversation

@ThomasGmeinder
Copy link
Copy Markdown
Contributor

Summary

  • Auto-detect AMD ROCm GPUs via rocm-smi and build llama.cpp with HIP acceleration when available, falling back to CPU-only build otherwise
  • Auto-detect GPU architecture (e.g. gfx1151) from rocm-smi instead of hardcoding gfx1150, and pass --n-gpu-layers 9999 to offload model layers to the GPU
  • Fix audio streaming to handle both the ROCm-built audio server (delta.audio, int16 PCM) and the pre-built CPU binary (delta.audio_chunk, float32 PCM)
  • Add CPU=1 make option to force a CPU-only build even when ROCm is available
  • Add make clean target and check_system.sh diagnostic script

Test report

  • Tested on AMD Ryzen with Radeon iGPU (gfx1151) using ROCm 7.2
  • Tested CPU-only build with make CPU=1 audioserver && make CPU=1 serve
  • Audio playback verified on both ROCm and CPU paths

Made with Cursor

Thomas Gmeinder and others added 9 commits March 13, 2026 20:25
…ers for ROCm builds

Auto-detect HIP_ARCH from rocminfo instead of hardcoding gfx1150, with
fallback if detection fails. Pass --n-gpu-layers 9999 to the audio server
when ROCm is present so model layers are offloaded to the GPU. Also add
a clean target and a check_system.sh helper script.

Made-with: Cursor
… server

The audio server built from PR #18641 (ROCm path) returns the audio
field as delta.audio with int16 PCM format, while the pre-built CPU
binary uses delta.audio_chunk with float32 PCM. Handle both field names
in server.py and auto-detect the PCM format (int16 vs float32) in the
browser for correct playback on both build paths.

Made-with: Cursor
…ng HIP build

Replace the /opt/rocm directory check with rocm-smi --showproductname
to verify a GPU is actually present and the driver is loaded. Falls back
to CPU build if ROCm is installed but not functional. Also use rocm-smi
for HIP_ARCH detection instead of rocminfo.

Made-with: Cursor
The ROCm-built audio server (PR #18641) returns delta.audio with int16
PCM, while the pre-built CPU binary returns delta.audio_chunk with
float32 PCM. Use the field name to set the format explicitly instead of
fragile auto-detection probing that failed on quiet audio chunks.

Made-with: Cursor
Allow users to bypass ROCm GPU detection and force a CPU-only build
with make CPU=1. Useful for testing or when the GPU build is unwanted.

Tested on ROCm (gfx1151) and on CPU with CPU=1 override.

Made-with: Cursor
…at via env

Separate build directories (llama.cpp-rocm, llama.cpp-cpu) and binaries
(llama-server-rocm, llama-server-cpu) so switching between ROCm and CPU
no longer requires make clean. Both builds coexist on disk.

The Makefile passes AUDIO_PCM_FORMAT (int16 for ROCm, float32 for CPU)
to server.py via env var. server.py sends it to the browser as a config
message on websocket connect. The JS uses it directly — no data probing.

This resolves the audio distortion issues caused by the two audio server
binary versions returning different PCM encodings (int16 vs float32)
while both reporting format as "pcm".

Tested on ROCm (gfx1151) and CPU (make CPU=1) — audio works on both.

Made-with: Cursor
Server-side: log time-to-first-audio-byte (ASR + tool calling + TTS
first decode) and total end-to-end latency from ASR receive to last
TTS chunk.

Client-side: log TTFA in the browser console measuring from button
release to first audio chunk received, capturing the full user-perceived
latency including client overhead.

Also skip empty audio chunks to avoid createBuffer errors.

Made-with: Cursor
ROCm <= 7.2 ships rocBLAS Tensile kernels for gfx1100/1101/1102/1150/
1151/1200/1201 but NOT gfx1153 (Ryzen AI 7 / Krackan). The audio
server's multimodal warmup (mmproj + vocoder + speaker tokenizer)
dispatches GEMM shapes that have no matching gfx1153 kernel and
segfaults at `common_init_from_params: warming up the model`.

Set HSA_OVERRIDE_GFX_VERSION=11.5.0 (gfx1150 — binary-compatible RDNA
3.5) as a recipe-line prefix on the `audioserver` target only. Applying
it globally is not an option: the same override crashes the tool model
(llama-server-rocm) instead, so the env var is intentionally NOT
exported and does not leak via `serve` to spawn_server's child process.

A `$(wildcard /opt/rocm*/lib/rocblas/library/*gfx1153*)` check makes
the override self-disable once a future ROCm release ships the missing
kernels — the audio server will then run with native gfx1153 dispatch
without further Makefile edits.

Made-with: Cursor
@Paulescu
Copy link
Copy Markdown
Collaborator

Hi @ThomasGmeinder ,

What is the main intent of this PR?

Was the code not working on your AMD machine and you needed to fix it? Or are these 2nd-order optimizations to speed up inference?

Pau

@ThomasGmeinder
Copy link
Copy Markdown
Contributor Author

ThomasGmeinder commented Apr 30, 2026

Hi Pau
I added support for ROCm and AMD GPUs with the PR #41 which was merged on 25th Feb 2026. The intent of this PR is adding improvements on the same branch ROCm_support.

I have tested this on a wide range or Ryzen AI devices:
Krackan2e (4 CU iGPU), Strix (16 CU iGPU) and Strix Halo (40 CU iGPU)

Kind Regards, Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants