gfx1151: I have pytorch, pytorch-vision, and hipBLASLt working well enough to run GPT2 #244

scottt · 2025-03-23T07:49:50Z

scottt
Mar 23, 2025

I see others like https://github.com/jammm are also working on gfx1151

I have pytorch, pytorch-vision, and hipBLASLt working well enough to run GPT2 on the Strix Halo in an Asus Z13 here and thought others might want to take a look https://github.com/scottt/rocm-TheRock/commits/gfx1151/

(I'm running Bazzite Linux and the distro kernel)

To download the image from Dockerhub:

podman pull docker.io/scottt/therock:pytorch-vision-dev-f41
podman tag docker.io/scottt/therock:pytorch-vision-dev-f41 pytorch-vision-dev-f41
toolbox create -i pytorch-vision-dev-f41
toolbox enter pytorch-vision-dev-f41

Alternatively, to build the images from source:

git clone [email protected]:scottt/rocm-TheRock.git
cd rocm-TheRock
git switch gfx1151
podman build -t rocm-dev-f41 -f ./dockerfiles/pytorch-dev/rocm_fedora.Dockerfile .
podman build -t pytorch-dev-f41 -f dockerfiles/pytorch-dev/pytorch_dev_fedora.Dockerfile .
podman build -t pytorch-vision-dev-f41 -f dockerfiles/pytorch-dev/pytorch_vision_dev_fedora.Dockerfile .
toolbox create -i pytorch-vision-dev-f41
toolbox enter pytorch-vision-dev-f41

Running GPT2 through huggingface/transformers then appears to work:

pytorch-vision-dev-f41 $ uv pip install --system transformers
pytorch-vision-dev-f41 $ python gpt2.py

With the content of gpt2.py being:

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("gpt2") #, torch_dtype=torch.float16, attn_implementation="flash_attention_2")

tokenizer = AutoTokenizer.from_pretrained("gpt2")
prompt = "def hello_world():"
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
model.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)

r = tokenizer.batch_decode(generated_ids)[0]
print(r)

stellaraccident · 2025-03-23T15:33:57Z

stellaraccident
Mar 23, 2025
Maintainer

This is great! I'll have a closer look in the coming days and help you get this landed and in the ci.

1 reply

scottt Mar 23, 2025
Author

When you look at these Dockerfiles, you'll find that they abuse a type=cache mount and places the ROCm, pytorch source and build trees in there.

I did that due to the docker build cache invalidation problem mentioned in [Discussion] current source layout frequently invalidates docker build cache #243
I started out with ROCm, pytorch, and pytorch-vision building a in a multi-stage Dockerfile in the style of pytorch_dev_ubuntu_24.04.Dockerfile but added the cache mount and separated the Dockerfiles after building the ROCm stack one too many times on an Asus Z13 ;)

jammm · 2025-03-23T18:27:36Z

jammm
Mar 23, 2025
Collaborator

Awesome ! I’ll try this tonight

2 replies

jammm Mar 23, 2025
Collaborator

BTW you may have tagged the wrong jamm in your initial post ;)

scottt Mar 23, 2025
Author

Fixed by editing! ;)

scottt · 2025-03-23T20:15:29Z

scottt
Mar 23, 2025
Author

@jammm, compared to the mainline-snapshot-2025-03-13 tarball and current main branch, my build contains the following changes:

Build hipBLASLt with Enable gfx1103, gfx1150 and gfx1151 hipBLASLt#1766 applied (scottt@2059658)
- Out of curiosity, @powderluv, is this patch from @amd-mtrifuno the same as yours or are the two independently developed?
Patch and build Pytorch to allow using hipBLASLt on gfx1151 (scottt@ea1500c)
Fix build errors encountered along the way to building pytorch-vision
The Python wheels are in pytorch-vision-dev-f41:/opt
As the images are based on Fedora 41, the binaries won't run as-is on Ubuntu 22.04

2 replies

jammm Mar 23, 2025
Collaborator

Thanks! This is very useful.

powderluv Mar 24, 2025
Collaborator

Out of curiosity, @powderluv, is this patch from @amd-mtrifuno the same as yours or are the two independently developed?

yeah same. I just was getting help from them to land it cleanly.

scottt · 2025-03-23T20:51:25Z

scottt
Mar 23, 2025
Author

I've pushed my container images to Dockerhub at https://hub.docker.com/r/scottt/therock/tags and editing the post to reflect that.

0 replies

scottt · 2025-03-24T02:22:44Z

scottt
Mar 24, 2025
Author

I've pushed some Dockerfile failed to build from a clean slate fixes previous masked by my placing the source trees in a cache mount.

This version of my prebuild container image has a Pytorch wheel that uses hipBLASLt on gfx1151 https://hub.docker.com/repository/docker/scottt/therock/tags/pytorch-vision-dev-f41/sha256:78216545c4639935d1f22dcd80b7e60ef06488f66d87b3c35350e68f91922fa6

0 replies

jammm · 2025-03-25T22:37:53Z

jammm
Mar 25, 2025
Collaborator

@scottt I tried your container but the result of gpt2 script always ends up giving:

Memory access fault by GPU node-1 (Agent handle: 0x5a8227bfd5e0) on address 0x76517b820000. Reason: Page not present or supervisor privilege.

Oddly enough, if I run the script with AMD_LOG_LEVEL=3 or even 7, it gives one of the following outputs:
e.g., AMD_LOG_LEVEL=7 python gpt2.py 2> out.txt:

def hello_world():<|endoftext|>
or,
!!!!!!!!

And running something simple like torch.randn(10, device="cuda") gives all zeros. Something is definitely odd.. have to debug more

6 replies

jammm Mar 26, 2025
Collaborator

This could be likely. I'm on ubuntu 24.04.02 with kernel 6.11. I'll try upgrading to 6.14 and see if it works and report back.

jammm Mar 26, 2025
Collaborator

@scottt btw did you also install any amdgpu drivers? Or did you rely on the kernel+distro to provide one out of the box?

jammm Mar 26, 2025
Collaborator

@scottt it turns out the issue was the amdgpu kernel mode drivers (amdgpu-dkms-firmware and dkms) that were bundled with ubuntu 24.04.02 didn't work well with gfx1151.
I installed the latest public drivers by using amdgpu-install --usecase=dkms and that did the trick! gpt2.py works now. Will try other models on this thing.

scottt Mar 26, 2025
Author

I was just going to reply that I only used the distro kernel out of the box :)

jammm Mar 27, 2025
Collaborator

@scottt interesting. I'm not sure why I had to install the drivers myself in Ubuntu 24.04 case then.. weird

scottt · 2025-03-26T05:01:50Z

scottt
Mar 26, 2025
Author

I'v found that MIOpen built with current main branch options on gfx1151 would not work with torch.nn.Conv2d. I encountered this while trying:

stable-diffusion
language models with visual abilities on Huggingface
(or likely any neural network image processing code using Pytorch)

I have two patches that appears to successfully workaround the problem:

MIOpen: build without AI_KERNEL_TUNING and AI_IMMED_MODE_FALLBACK this overcomes MIOpen Error: tensor_shape_variable needs to be an array RuntimeError: miopenStatusUnknownError
- I cargo culted the build option changes from frugally-deep 0.16.0 appears to break kernel/model files MIOpen#3588 (comment) based on Googling
After this, torch.nn.Conv2d would work on float32 but produce an illegal opcode error for float16, which leads me to patching MIOpen: disable solver/conv_wino_fury_RxS for gfx115x to disable the conv_wino_fury_RxS assembly kernel
- MIOpen would then choose the naive_conv_ab_nonpacked_fwd_nchw_half_double_half kernel, which appears to work. Short torch.nn.Conv2d programs to reproduce the problem are in the linked patches

With these changes, I can now run AUTOMATIC1111/stable-diffusion-webui

A prompt of "Generate two tuxedo cats pushing an Asus Z13 laptop to off the table while it's compiling the MIOpen library" produces:

2 replies

jammm Mar 26, 2025
Collaborator

@powderluv @stellaraccident we probably need to poke MIOpen folks to look into enabling the missing conv2d kernels and AI Tuning for OOTB gfx1151 support.

jammm Mar 26, 2025
Collaborator

I can confirm these patches also help with running the SF3D model:

But the performance is quite slow, and VRAM usage is almost 3x than what it should be (6gb vs 17gb)
This is probably because of the disabling of the optimized MIOpen kernels due to lack of gfx1151 support.

scottt · 2025-03-26T07:35:47Z

scottt
Mar 26, 2025
Author

This image has my MIOpen fixes: https://hub.docker.com/layers/scottt/therock/pytorch-vision-dev-f41/images/sha256-3e80ce26fa02bf4a566ed693e1d6741cf8aaa742319f688e7c584d5a83bb47c8

0 replies

jammm · 2025-03-27T02:01:26Z

jammm
Mar 27, 2025
Collaborator

Couple TODOs:

CK should use wmma for gfx1151 https://github.com/ROCm/composable_kernel/blob/441343a23dbd92a7fe81a30a1e3ad39bc81b11c2/include/ck/utility/amd_wmma.hpp#L12-L13
Avoid MIOpen workarounds for better performance gfx1151: I have pytorch, pytorch-vision, and hipBLASLt working well enough to run GPT2 #244 (comment)
enable gfx1151 for aotriton (for flash attention impls)

2 replies

scottt Mar 27, 2025
Author

I'll work on Triton next.

jammm Mar 28, 2025
Collaborator

Re. MIOpen+CK - Apparently it fails to build with linker errors if you don't also compile with a gfx940/gfx942 in the arch flags alongside gfx1151 ROCm/ROCm#4030 (comment)

jammm · 2025-03-30T06:47:29Z

jammm
Mar 30, 2025
Collaborator

I got some massive perf improvements after compiling aotriton for gfx1151 so it uses the flash-attention/memory-efficient backends when running scaled_dot_product_attention:

before:

Benchmarking Scaled Dot-Product Attention (Flash) in FP16 ...
Average time: 180.45 ms
Average peak memory: 4728.06 MB
Average throughput: 761.92 GFLOP/s

after:

Benchmarking Scaled Dot-Product Attention (Flash) in FP16 ...                                                                                                                                 
Average time: 11.26 ms                                                                                                                                                                        
Average peak memory: 32.25 MB                                                                                                                                                                 
Average throughput: 12214.04 GFLOP/s

above benchmark script:
https://termbin.com/98rt

Using a couple real-world models:
stable-fast-3d image-to-mesh generation time: From 41 seconds to 1.4 seconds
sd3.5: (14-15)s/it to 2.17s/it (this is still affected by the conv fury stuff @scottt found.. need to fix)

29 replies

scottt Apr 12, 2025
Author

@antiagainst , Thanks for reaching out! Will definitely be sending Triton patches and questions your way.

scottt Apr 13, 2025
Author

@stellaraccident , I've successfully finished a partialbuild on Windows as documented here #409

scottt Apr 23, 2025
Author

@jammm , do you still have the scaled dot-product benchmark script from https://termbin.com/98rt? I wanted to recreate your measurement today but found that termbin.com had deleted its copy.

jammm Apr 23, 2025
Collaborator

Ah I might have lost that one :/ I do have an older version though. Modifying this should help:

import torch
import time
from torch.nn.functional import scaled_dot_product_attention, SDPBackend

###############################################################################
# Check for GPU
###############################################################################
if not torch.cuda.is_available():
    raise SystemExit("CUDA GPU is not available. Please run on a CUDA-enabled device.")

device = torch.device("cuda")
torch.cuda.init()  # Initialize CUDA context (optional, helps measure baseline)

###############################################################################
# Helper function for measuring one run
###############################################################################
def measure_op(op_func, warmup=3, total_runs=10):
    """
    op_func: a callable that runs the operation (including memory measurement)
             and returns (time_ms, peak_mem_MB, gflops_s).
    warmup: number of warm-up runs to discard.
    total_runs: total runs to do, including warmup.
    Returns: average_time_ms, average_peak_mem_MB, average_gflops_s over the runs after warm-up.
    """
    times = []
    mems = []
    flops = []

    for run_idx in range(total_runs):
        # Reset peak memory stats at the start of each run
        torch.cuda.reset_peak_memory_stats(device)
        
        t_ms, peak_mb, gf_s = op_func()
        
        if run_idx >= warmup:
            times.append(t_ms)
            mems.append(peak_mb)
            flops.append(gf_s)

    avg_time_ms = sum(times) / len(times)
    avg_mem_mb = sum(mems) / len(mems)
    avg_gf_s = sum(flops) / len(flops)
    return avg_time_ms, avg_mem_mb, avg_gf_s


###############################################################################
# 1) Define the Scaled Dot-Product Attention test
###############################################################################
def run_sdpa():
    # Configuration
    B, heads = 1, 8
    L = 8192
    E = 64
    S = L

    # Create random Q, K, V in half precision
    # We place them inside the function so each run re-allocates 
    # new memory (to measure peak memory usage properly).
    q = torch.randn(B, heads, L, E, device=device, dtype=torch.float16)
    k = torch.randn(B, heads, S, E, device=device, dtype=torch.float16)
    v = torch.randn(B, heads, S, E, device=device, dtype=torch.float16)

    # Start timing
    torch.cuda.synchronize()
    start_time = time.time()

    # Run scaled dot-product attention (Flash Attention backend)
    out = scaled_dot_product_attention(q, k, v, backend=SDPBackend.FLASH_ATTENTION)

    # Synchronize & end timing
    torch.cuda.synchronize()
    end_time = time.time()

    # Measure time
    time_ms = (end_time - start_time) * 1000.0

    # Peak memory usage (MB)
    peak_mem_bytes = torch.cuda.max_memory_allocated(device)
    peak_mem_mb = peak_mem_bytes / (1024**2)

    # Compute FLOPs for scaled dot-product attention:
    # Q*K^T -> 2 * B * heads * L * S * E
    # Attn*V -> 2 * B * heads * L * S * E
    # Total = 4 * B * heads * L * S * E
    flops = 4.0 * B * heads * L * S * E
    # Convert to GFLOPs/s
    flops_s = flops / (end_time - start_time)
    gflops_s = flops_s / 1e9

    return time_ms, peak_mem_mb, gflops_s

###############################################################################
# 2) Define the Conv2d test
###############################################################################
def run_conv2d():
    # Configuration
    N, Cin, Cout = 1, 3, 64
    H, W = 4096, 4096
    kernel_size = 7
    stride = 1
    padding = 3

    # Create Conv2d layer in half precision
    conv = torch.nn.Conv2d(Cin, Cout, kernel_size=kernel_size, 
                           stride=stride, padding=padding).to(device, dtype=torch.float16)
    x = torch.randn(N, Cin, H, W, device=device, dtype=torch.float16)

    # Start timing
    torch.cuda.synchronize()
    start_time = time.time()

    # Forward pass
    y = conv(x)

    # Synchronize & end timing
    torch.cuda.synchronize()
    end_time = time.time()

    # Measure time
    time_ms = (end_time - start_time) * 1000.0

    # Peak memory usage (MB)
    peak_mem_bytes = torch.cuda.max_memory_allocated(device)
    peak_mem_mb = peak_mem_bytes / (1024**2)

    # Compute FLOPs for Conv2d:
    # 2 * N * Cout * H_out * W_out * Cin * kH * kW
    N_out, Cout_out, H_out, W_out = y.shape
    Cin_out = conv.in_channels
    kH, kW = conv.kernel_size if isinstance(conv.kernel_size, tuple) else (conv.kernel_size, conv.kernel_size)
    flops = 2.0 * N_out * Cout_out * H_out * W_out * Cin_out * kH * kW
    flops_s = flops / (end_time - start_time)
    gflops_s = flops_s / 1e9

    return time_ms, peak_mem_mb, gflops_s

###############################################################################
# Run the measurements
###############################################################################
print("Benchmarking Scaled Dot-Product Attention (Flash) in FP16 ...")
sdpa_time, sdpa_mem, sdpa_gflops = measure_op(run_sdpa, warmup=3, total_runs=10)
print(f"Average time: {sdpa_time:.2f} ms")
print(f"Average peak memory: {sdpa_mem:.2f} MB")
print(f"Average throughput: {sdpa_gflops:.2f} GFLOP/s\n")

print("Benchmarking Conv2d in FP16 ...")
conv_time, conv_mem, conv_gflops = measure_op(run_conv2d, warmup=3, total_runs=10)
print(f"Average time: {conv_time:.2f} ms")
print(f"Average peak memory: {conv_mem:.2f} MB")
print(f"Average throughput: {conv_gflops:.2f} GFLOP/s")

scottt Apr 23, 2025
Author

Thanks @jammm!

jammm · 2025-04-03T01:13:04Z

jammm
Apr 3, 2025
Collaborator

Continuing the results from #244 (reply in thread),
Got stable-virtual-camera running (with torch.compile as well!)

Input image (from the sd3.5 output on gfx1151)

Output video (dolly-zoom-out camera trajectory, had to downscale the gif to be under the 10mb attachment limit)

mp4:
https://github.com/user-attachments/assets/016acef8-8856-4d15-975c-89219a06a3fa

Just had to make sure pytorch was compiled with LAPACK by installing libopenblas-dev before the pytorch build. Also had to build triton itself. For using torch.compile, I set this if to True https://github.com/Stability-AI/stable-virtual-camera/blob/main/demo_gr.py#L100

There's one caveat though - multiple times the GPU would crash out with segfaults like Memory access fault by GPU node-1 (Agent handle: 0x3ddbb700) on address 0x75b019362000. Reason: Page not present or supervisor privilege. - this would be completely random, so I suspect this is related to the amdgpu kernel driver being unstable on my end due to an older kernel or mangled driver install. I had to re-run multiple times until it generated a result till completion.

0 replies

scottt · 2025-04-07T22:00:34Z

scottt
Apr 7, 2025
Author

@stellaraccident , for the first part of the gfx1151 enablement work, could you take a look at merging #357 ?

I bumped the hipBLASLt submodule to include the gfx1151 support from upstream but had to update the patches in TheRock for how hipBLASLt finds tools, which was moderately painful.

5 replies

stellaraccident Apr 8, 2025
Maintainer

Thanks. I'll handle it tomorrow pdt unless if Marius gets to it first.

jammm Apr 10, 2025
Collaborator

@scottt thanks! looks like this is merged.
Will there be another PR for the rest of the changes including bumping pytorch to 2.7rc3 w/ aotriton?
Currently pytorch is already at 2.7.0rc8, and a full release is scheduled on April 23rd. Perhaps we could wait until stable 2.7.0 is cut and then land that instead.. though I may have to rework the patches.. problem is I switched away to Windows already :x

It shouldn't be too difficult to rebase those new patches though. Hopefully it might just work as-is. In the worst case I can try to do this from Windows once the closed bits land and we have the dependencies compiling.

scottt Apr 10, 2025
Author

@jammm , Yes! I've just filed the MIOpen one here #392

scottt Apr 10, 2025
Author

@jammm , I'm working on the pytorch pull request that should get TheRock to roughly the equivalent of your work enabling flash attention via aotriton.

Please write some notes on your Windows setup so I can crib from you :)

jammm Apr 10, 2025
Collaborator

@scottt thanks! and yes will definitely write up some notes here as I go through the setup. Waiting for the closed stuff to be opened before proceeding.

scottt · 2025-04-10T16:16:23Z

scottt
Apr 10, 2025
Author

@stellaraccident , could you take at merging #392 for the MIOpen portion of the gfx1151 work?

7 replies

jammm Apr 12, 2025
Collaborator

btw you answered my query on a different sub-thread at #244 (reply in thread) but no worries. I also missed that when I replied above XD

scottt Apr 12, 2025
Author

@stellaraccident , Downloading now. Will setup my Windows machine, try the native build and report back.

stellaraccident Apr 12, 2025
Maintainer

Cool. We'll put more time into it next week. It's just a toy without the blas libraries. We've got patches scattered around and this stuff has built in windows before internally... But it's a bit of a tangle we need to straighten. Anyway, photo finish getting it this far this week

stellaraccident Apr 14, 2025
Maintainer

This thread was about getting Linux working. There is another thread for native windows. I have a to-do on the AMD side to track down WSL support status.

stellaraccident Apr 17, 2025
Maintainer

I've sorted #392 and gotten it landed locally in TheRock and upstream PRs staged. The MIOpen team may want to discuss the ones that directly impacted gfx1151 support. I've coordinated offline with them on testing strategy so we get infra set up so this doesn't regress from here.

scottt · 2025-04-23T01:35:52Z

scottt
Apr 23, 2025
Author

@stellaraccident , @jammm , I've filed the PR for gfx1151 Pytorch work here: #449

6 replies

stellaraccident Apr 23, 2025
Maintainer

None yet. Sorry :/

jammm Apr 23, 2025
Collaborator

npnp, I'll be focusing on Windows for now anyway.

scottt Apr 24, 2025
Author

@jammm , now that #449 has landed (Yes!!!), I'll work on aotriton testing and the gfx1151 aotriton upstream patch ROCm/aotriton#91 a bit more before switching to Triton and aotriton on Windows.

scottt Apr 24, 2025
Author

The gfx1151 aotriton patch has been merged ROCm/aotriton#91

jammm Apr 24, 2025
Collaborator

Great work @scottt ! Thank you for pushing our patches there and testing ^^

stellaraccident · 2025-05-01T04:09:42Z

stellaraccident
May 1, 2025
Maintainer

FYI, we've identified a root cause for ROCm/MIOpen#3685 but don't have a fix yet

3 replies

scottt May 1, 2025
Author

Cool! Will definitely want to read up on that.

jammm May 1, 2025
Collaborator

Curious to know as well. Sounds like it's something related to the wrong kernel being executed maybe?

stellaraccident May 1, 2025
Maintainer

There's an incorrect hardware configuration limit being set in a low level library that is impacting the ability to launch workgroups that occupy the large number of CUs/VGPRs on this chip. Still working out specifics that can be shared.

erman-gurses · 2025-05-02T20:32:16Z

erman-gurses
May 2, 2025
Maintainer

While I am running the command below for gfx1151 target

sudo docker buildx build  --build-arg AMDGPU_TARGETS=gfx1151 --file dockerfiles/pytorch-dev/pytorch_dev_ubuntu_24.04.Dockerfile .

I got the error below:
error-05022025.txt

Let me know if there is a patch for that.

3 replies

stellaraccident May 2, 2025
Maintainer

Always good to reduce clicks and inline the key error:

/therock/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing.hpp:47:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'
1367.6    47 |     wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD;

I think @scottt did some patches/work for the 3rd word stuff. Possibly something new or a patch missing.

scottt May 2, 2025
Author

@erman-gurses, add --pytorch-ref v2.7.0 to the options passed to ptbuild.py checkout locally for now. The gfx1151 patches are only merged for Pytorch 2.7 and ptbuild.py still checks out 2.6 by default.

erman-gurses May 2, 2025
Maintainer

@scottt The problem is solved by adding the flag - thank you!

bugbuster-dev · 2025-05-22T22:55:41Z

bugbuster-dev
May 22, 2025

any docs to integrate this to get it working with vllm?

8 replies

scottt May 23, 2025
Author

@bugbuster-dev , for ROCm, I'd use the pre-built tarballs here or build TheRock's main branch from source (larger time commitment). Then copy the rocm directory into a docker image and build pytorch, pytorch-vision, pytorch-audio from there.

ChiweiDing Jun 13, 2025

Hi @scottt I’m currently trying to build PyTorch with the ROCm tarball you provided:
therock-dist-linux-gfx1151-7.0.0rc20250612.tar.gz
My setup:

Platform: Ubuntu 24.04
GPU: AMD Strix Halo (gfx1151)
ROCm path: manually extracted to ~/projects/rocm-gfx1151
Environment: local build using PyTorch release/2.7 branch

I’m currently encountering multiple build issues during compilation (especially Gloo HIP related), and I wonder:
Which specific versions (or git commits) of the following packages are known to be compatible with this tarball?

pytorch
torchvision
torchaudio

Is there an official branch or example setup that you used to validate this tarball?
If this tarball is not yet meant for full training workloads, could you share its intended use case (inference only, specific models, etc)?

Thanks for your support!

stellaraccident Jun 13, 2025
Maintainer

There are two tracks for pytorch currently.

The work that @scottt and the community are doing (on this thread) is running pretty far ahead feature wise of what our official pytorch builds have enabled. From what I've seen, they are prioritizing a variety of genai inference cases.

The AMD side is working a bit more methodically in terms of locking in releasable assets with full CI (what is documented in RELEASES.md). In the fullness of time, both will be one, but we are taking the extra time for rocm library and CI enablement at each step. The current source of truth for the release pytorch builds is here:

TheRock/external-builds/pytorch/build_prod_wheels.py

Line 268 in 2d22ba3

"USE_GLOO": "OFF",

Specific to your question, we have an engineer working now to enable GLOO.

Additionally, you should be aware that the adjacent patches directory is kept up to date for the 2.7 backport patches we apply. Presently, pytorch head does not require any patches but we are quitting to stand up nightly builds off it in CI on all platforms in order to lock that in. Goal is to have patch free, full functionality for pytorch 2.8 and a backport set of patches for 2.7 for a time (at all times, we'd like to have official support for building head and latest stable).

I suspect that the non official builds are being done with the patch stack referenced above against 2.7, but they are moving really fast, and @scottt is the best source of information for the bleeding edge rig.

medioxor Aug 15, 2025

@scottt @bugbuster-dev any luck with getting vllm working?

lhl Sep 3, 2025

@medioxor if you're interested, I got a free weekend and got vLLM at least nominally up and running. It does require a custom torch build (with USE_DISTRIBUTED and USE_GLOO, disabling amdsmi and RSMI and some other animal sacrifices) atm: https://github.com/lhl/strix-halo-testing/tree/main/vllm

It may be easier in the near future (aotriton w/ gfx1151 compatibility requires torch 2.8+ and USE_DISTRIBUTED will also be in future TheRock pytorch builds). You might want to switch from nightly to developer builds.

bugbuster-dev · 2025-08-15T20:47:48Z

bugbuster-dev
Aug 15, 2025

i installed rocm therock nightly tarball therock-dist-linux-gfx1151-7.0.0rc20250714.tar.gz and rocm pytorch from https://d2awnip2yjpvqn.cloudfront.net/v2/gfx1151/ and tried to compile https://github.com/ROCm/vllm/ repo but build failed. when amd has added the gfx1151 support i assume they will provide the docker image on https://hub.docker.com/r/rocm/vllm-dev

…

On Fri, Aug 15, 2025 at 3:21 PM medioxor ***@***.***> wrote: @scottt <https://github.com/scottt> @bugbuster-dev <https://github.com/bugbuster-dev> any luck with getting vllm working? — Reply to this email directly, view it on GitHub <#244 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARD6KQCSAJ3KFPUKEIXR2ML3NXNEXAVCNFSM6AAAAABZSYJAR6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTIMJRGY3TENQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

gfx1151: I have pytorch, pytorch-vision, and hipBLASLt working well enough to run GPT2 #244

Uh oh!

Uh oh!

scottt Mar 23, 2025

Replies: 18 comments · 76 replies

Uh oh!

stellaraccident Mar 23, 2025 Maintainer

Uh oh!

Uh oh!

scottt Mar 23, 2025 Author

Uh oh!

jammm Mar 23, 2025 Collaborator

Uh oh!

jammm Mar 23, 2025 Collaborator

Uh oh!

scottt Mar 23, 2025 Author

Uh oh!

Uh oh!

scottt Mar 23, 2025 Author

Uh oh!

jammm Mar 23, 2025 Collaborator

Uh oh!

powderluv Mar 24, 2025 Collaborator

Uh oh!

scottt Mar 23, 2025 Author

Uh oh!

scottt Mar 24, 2025 Author

Uh oh!

Uh oh!

jammm Mar 25, 2025 Collaborator

Uh oh!

Uh oh!

jammm Mar 26, 2025 Collaborator

Uh oh!

jammm Mar 26, 2025 Collaborator

Uh oh!

jammm Mar 26, 2025 Collaborator

Uh oh!

scottt Mar 26, 2025 Author

Uh oh!

jammm Mar 27, 2025 Collaborator

Uh oh!

Uh oh!

scottt Mar 26, 2025 Author

Uh oh!

jammm Mar 26, 2025 Collaborator

Uh oh!

jammm Mar 26, 2025 Collaborator

Uh oh!

scottt Mar 26, 2025 Author

Uh oh!

jammm Mar 27, 2025 Collaborator

Uh oh!

scottt Mar 27, 2025 Author

Uh oh!

jammm Mar 28, 2025 Collaborator

Uh oh!

Uh oh!

jammm Mar 30, 2025 Collaborator

Uh oh!

Uh oh!

scottt Apr 12, 2025 Author

Uh oh!

scottt Apr 13, 2025 Author

Uh oh!

scottt
Mar 23, 2025

Replies: 18 comments 76 replies

stellaraccident
Mar 23, 2025
Maintainer

scottt Mar 23, 2025
Author

jammm
Mar 23, 2025
Collaborator

jammm Mar 23, 2025
Collaborator

scottt Mar 23, 2025
Author

scottt
Mar 23, 2025
Author

jammm Mar 23, 2025
Collaborator

powderluv Mar 24, 2025
Collaborator

scottt
Mar 23, 2025
Author

scottt
Mar 24, 2025
Author

jammm
Mar 25, 2025
Collaborator

jammm Mar 26, 2025
Collaborator

jammm Mar 26, 2025
Collaborator

jammm Mar 26, 2025
Collaborator

scottt Mar 26, 2025
Author

jammm Mar 27, 2025
Collaborator

scottt
Mar 26, 2025
Author

jammm Mar 26, 2025
Collaborator

jammm Mar 26, 2025
Collaborator

scottt
Mar 26, 2025
Author

jammm
Mar 27, 2025
Collaborator

scottt Mar 27, 2025
Author

jammm Mar 28, 2025
Collaborator

jammm
Mar 30, 2025
Collaborator

scottt Apr 12, 2025
Author

scottt Apr 13, 2025
Author