diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst index d1535960a0257..9ca86aaa95b44 100644 --- a/llvm/docs/AMDGPUUsage.rst +++ b/llvm/docs/AMDGPUUsage.rst @@ -1136,6 +1136,41 @@ is conservatively correct for OpenCL. other operations within the same address space. ======================= =================================================== +Relaxed Buffer OOB (Out Of Bounds) Mode +--------------------------------------- + +Instructions that load from or store to buffer resources (and thus, by extension +buffer fat pointers and buffer strided pointers) generally implement handling for +out of bounds (OOB) memory accesses, including those that are partially OOB, +if the buffer resource resource has the required flags set. + +When operating on more than 32 bits of data, the `voffset` used for the access +will be range-checked for each 32-bit word independently. This check uses saturating +arithmetic and interprets the offset as an unsigned value. + +The behavior described above conflicts with the ABI requirements of certain graphics +APIs that require out of bounds accesses to be handled strictly so that accessed +that begin out of bounds but then access in-bounds elements (such as loading A +``<4 x i32>`` beginning at offset ``-4``) still load the three in-bounds integers. + +Similarly, buffer fat pointers permit operating types such as `<8 x i8>` which +must be accessed (and bounds-checked) 4 bytes at a time. Non-word-aligned +accesses to such types from near the end of a buffer resource (such as starting +a load of an ``<8xi8>`` from an offset of ``6`` on an 8-byte buffer) will treat +the initial two bytes to be loaded/stored as out of bounds, even though, under +a strict interpretation of the bounds-checking semantics, they would be out of bounds. + +These violations of strict bounds-checking semantics for buffer resources require +usage of less-vectorized code to ensure correctness. Ifthis strict conformance +is not required, the target feature ``relaxed-oob-buffer-mode`` should be enabled +(using ``-mcpu``, ``-offload-arch`` or ``-mattr``). + +``relaxed-buffer-oob-mode`` permits unaligned memory acceses through a buffer resource +to propagate to nearby elemennts, causing them to become out of bounds as well. + +``relaxed-buffer-oob-mode`` is **enabled** on HSA targets by default to preserve +compute performance and existing ABI expectations. + LLVM IR Intrinsics ------------------ diff --git a/llvm/docs/ReleaseNotes.md b/llvm/docs/ReleaseNotes.md index 58cf71b947083..411c469d32b09 100644 --- a/llvm/docs/ReleaseNotes.md +++ b/llvm/docs/ReleaseNotes.md @@ -92,6 +92,11 @@ Changes to the AMDGPU Backend * Bump the default `.amdhsa_code_object_version` to 6. ROCm 6.3 is required to run any program compiled with COV6. +* Turn on strict buffer OOB checking on non-AMDHSA OSs. This improves the correctness + of buffer accesses in some cases at the cost of performance for programs that do not + contain unaligned out-of-bounds accesses. The old behavior may be restored with the + `relaxed-buffer-oob-mode` feature. + Changes to the ARM Backend -------------------------- diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.cpp b/llvm/lib/Target/AMDGPU/GCNSubtarget.cpp index 53f5c1efd14eb..1bd2230b626ee 100644 --- a/llvm/lib/Target/AMDGPU/GCNSubtarget.cpp +++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.cpp @@ -71,7 +71,8 @@ GCNSubtarget &GCNSubtarget::initializeSubtargetDependencies(const Triple &TT, // Turn on features that HSA ABI requires. Also turn on FlatForGlobal by // default if (isAmdHsaOS()) - FullFS += "+flat-for-global,+unaligned-access-mode,+trap-handler,"; + FullFS += "+flat-for-global,+unaligned-access-mode,+trap-handler," + "+relaxed-buffer-oob-mode,"; FullFS += "+enable-prt-strict-null,"; // This is overridden by a disable in FS diff --git a/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll index ede2e4066c263..01239b9946e64 100644 --- a/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll +++ b/llvm/test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll @@ -1,5 +1,5 @@ -; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -mattr=+relaxed-buffer-oob-mode -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-RELAXED %s -; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-STRICT %s +; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-RELAXED %s +; RUN: opt -mtriple=amdgcn-amd-amdhsa -passes=load-store-vectorizer -mattr=-relaxed-buffer-oob-mode -S -o - %s | FileCheck --check-prefixes=CHECK,CHECK-OOB-STRICT %s target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7"