Why does cuda::atomic::store(memory_order_seq_cst) generate a relaxed store instead of a release store? #3827

admbbs · 2025-02-16T05:48:52Z

admbbs
Feb 16, 2025

I notice that cccl implements cuda::atomic::store(memory_order_seq_cst) with a fence.sc followed by a relaxed store. We may prove it through this code snippet.

But the ASPLOS_2019 PTX Memory Model paper states in section 4.2 that a release store is necessary:

One particular mapping required extra attention: .release annotations are not redundant with a leading fence.sc, even though they may seem to be.

Are there any new developments on this, or it is a implementation error?

Thanks a lot!

——————————
update: i just find that a subsequent store release in the same thread is not treated as part of the release sequence anymore.

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0982r1.html

does this have anything to do with this topic?

Answered by gonzalobg

Feb 18, 2025

Very good question!

First, libcu++ atomics currently rely on implementation details (EDIT: CUDA 13.0 PTX Atomic ABI docs enables any SW to make use of this) which, in currently supported platforms, enable libcu++ to lower:

sequentially-consistent stores to fence.sc; st.relaxed; instead of fence.sc; st.release;.
sequentially-consistent rmws to fence.sc; atom.acquire; instead of fence.sc; atom.acq_rel;.
libc++ is closely tied to the implementation (CUDA Toolkit, compiler, driver, hw) and if the above changes, we'll update it accordingly.

Second, you are totally right that the current expansion is not correct according to the model published in the ASPLOS ’19 paper, or the PTX Atomics ABI …

View full answer

gonzalobg · 2025-02-18T13:43:49Z

gonzalobg
Feb 18, 2025
Collaborator

Very good question!

First, libcu++ atomics currently rely on implementation details (EDIT: CUDA 13.0 PTX Atomic ABI docs enables any SW to make use of this) which, in currently supported platforms, enable libcu++ to lower:

sequentially-consistent stores to fence.sc; st.relaxed; instead of fence.sc; st.release;.
sequentially-consistent rmws to fence.sc; atom.acquire; instead of fence.sc; atom.acq_rel;.
libc++ is closely tied to the implementation (CUDA Toolkit, compiler, driver, hw) and if the above changes, we'll update it accordingly.

Second, you are totally right that the current expansion is not correct according to the model published in the ASPLOS ’19 paper, or the PTX Atomics ABI which is what we require external SW to follow. We’ve actually considered this a bug in the ASPLOS ’19 memory model and the ABI for a while, and although we haven’t gotten to it yet, we intend to update the model formalism to reflect the fact that the mapping with the relaxed store is sound in practice.

5 replies

admbbs Feb 18, 2025
Author

Thanks for this marvelous answer. It looks that GPU has got an atomic operations mapping which quietly resembles the one POWER has.

https://www.cl.cam.ac.uk/%7Epes20/cpp/cpp0xmappings.html

gonzalobg Aug 11, 2025
Collaborator

@admbbs as of CUDA 13.0 the PTX Atomics ABI [0] has been updated to enable any SW (e.g. LLVM's NVPTX backend) to leverage libcu++'s more optimal lowering.

[0] https://docs.nvidia.com/cuda/ptx-writers-guide-to-interoperability/atomic-abi.html

admbbs Aug 12, 2025
Author

thanks for the update!

admbbs Aug 12, 2025
Author

I notice an interesting detour in the history of this doc as to the mapping of seq-cst atomic load and store:

CUDA version	mapping - sc load	mapping - sc store
12.8.1	fence.sc.; ld.acquire.;	fence.sc.; st.release.;
12.9.0	fence.sc.; ld.relaxed.;	fence.sc.; st.relaxed.;
13.0.0	fence.sc.; ld.acquire.;	fence.sc.; st.relaxed.;

and two questions quite confusing:

why the mapping of sc load goes back to fence.sc + ld.acq in 13.0.0 rather than keeps fence.sc + ld.rlx in 12.9.0?
why the mapping of sc store keeps fence.sc + st.rlx in 13.0.0 rather than go back to fence.sc + st.rel, like of sc load?

gonzalobg Aug 12, 2025
Collaborator

The mappings in 12.9.0 are incorrect, they should have matched the 13.0.0 mappings; a patch is in progress.

Why does cuda::atomic::store(memory_order_seq_cst) generate a relaxed store instead of a release store? #3827

Uh oh!

Uh oh!

admbbs Feb 16, 2025

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

gonzalobg Feb 18, 2025 Collaborator

Uh oh!

admbbs Feb 18, 2025 Author

Uh oh!

gonzalobg Aug 11, 2025 Collaborator

Uh oh!

admbbs Aug 12, 2025 Author

Uh oh!

admbbs Aug 12, 2025 Author

Uh oh!

Uh oh!

gonzalobg Aug 12, 2025 Collaborator

admbbs
Feb 16, 2025

Replies: 1 comment 5 replies

gonzalobg
Feb 18, 2025
Collaborator

admbbs Feb 18, 2025
Author

gonzalobg Aug 11, 2025
Collaborator

admbbs Aug 12, 2025
Author

admbbs Aug 12, 2025
Author

gonzalobg Aug 12, 2025
Collaborator