Skip to content

[AMDGPU] Document "relaxed buffer OOB mode", update HSA default #134734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions llvm/docs/AMDGPUUsage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1136,6 +1136,41 @@ is conservatively correct for OpenCL.
other operations within the same address space.
======================= ===================================================

Relaxed Buffer OOB (Out Of Bounds) Mode
---------------------------------------

Instructions that load from or store to buffer resources (and thus, by extension
buffer fat pointers and buffer strided pointers) generally implement handling for
out of bounds (OOB) memory accesses, including those that are partially OOB,
if the buffer resource resource has the required flags set.

When operating on more than 32 bits of data, the `voffset` used for the access
will be range-checked for each 32-bit word independently. This check uses saturating
arithmetic and interprets the offset as an unsigned value.

The behavior described above conflicts with the ABI requirements of certain graphics
APIs that require out of bounds accesses to be handled strictly so that accessed
that begin out of bounds but then access in-bounds elements (such as loading A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
that begin out of bounds but then access in-bounds elements (such as loading A
that begin out of bounds but then access in-bounds elements (such as loading a

``<4 x i32>`` beginning at offset ``-4``) still load the three in-bounds integers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, will it be <undef, i32, i32, 32> or <i32, i32, i32, undef>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<0, i32, i32, i32> if you're using whatever the strict Vulkan API is

Ordinary, the left end gives you <0, 0, 0, 0,>


Similarly, buffer fat pointers permit operating types such as `<8 x i8>` which
must be accessed (and bounds-checked) 4 bytes at a time. Non-word-aligned
accesses to such types from near the end of a buffer resource (such as starting
a load of an ``<8xi8>`` from an offset of ``6`` on an 8-byte buffer) will treat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a load of an ``<8xi8>`` from an offset of ``6`` on an 8-byte buffer) will treat
a load of an ``<8 x i8>`` from an offset of ``6`` on an 8-byte buffer) will treat

the initial two bytes to be loaded/stored as out of bounds, even though, under
a strict interpretation of the bounds-checking semantics, they would be out of bounds.

These violations of strict bounds-checking semantics for buffer resources require
usage of less-vectorized code to ensure correctness. Ifthis strict conformance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
usage of less-vectorized code to ensure correctness. Ifthis strict conformance
usage of less-vectorized code to ensure correctness. If this strict conformance

is not required, the target feature ``relaxed-oob-buffer-mode`` should be enabled
(using ``-mcpu``, ``-offload-arch`` or ``-mattr``).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the --offload-arch work? Like --offload-arch=gfx1200:relaxed-oob-buffer-mode+? I think we only allow specific sub target features to be appended to target id.


``relaxed-buffer-oob-mode`` permits unaligned memory acceses through a buffer resource
to propagate to nearby elemennts, causing them to become out of bounds as well.

``relaxed-buffer-oob-mode`` is **enabled** on HSA targets by default to preserve
compute performance and existing ABI expectations.

LLVM IR Intrinsics
------------------

Expand Down
5 changes: 5 additions & 0 deletions llvm/docs/ReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,11 @@ Changes to the AMDGPU Backend

* Bump the default `.amdhsa_code_object_version` to 6. ROCm 6.3 is required to run any program compiled with COV6.

* Turn on strict buffer OOB checking on non-AMDHSA OSs. This improves the correctness
of buffer accesses in some cases at the cost of performance for programs that do not
contain unaligned out-of-bounds accesses. The old behavior may be restored with the
`relaxed-buffer-oob-mode` feature.

Changes to the ARM Backend
--------------------------

Expand Down
3 changes: 2 additions & 1 deletion llvm/lib/Target/AMDGPU/GCNSubtarget.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,8 @@ GCNSubtarget &GCNSubtarget::initializeSubtargetDependencies(const Triple &TT,
// Turn on features that HSA ABI requires. Also turn on FlatForGlobal by
// default
if (isAmdHsaOS())
FullFS += "+flat-for-global,+unaligned-access-mode,+trap-handler,";
FullFS += "+flat-for-global,+unaligned-access-mode,+trap-handler,"
"+relaxed-buffer-oob-mode,";

FullFS += "+enable-prt-strict-null,"; // This is overridden by a disable in FS

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -mattr=+relaxed-buffer-oob-mode -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-RELAXED %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-STRICT %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-RELAXED %s
; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -mattr=-relaxed-buffer-oob-mode -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-STRICT %s

target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7"

Expand Down