I suspect that the incorrect usage of cuda::memcpy_async has led to uncoalesced global/shared accesses. #5155
Unanswered
Time-Limit
asked this question in
libcu++
Replies: 1 comment
-
cuda::memcpy_async(B_sm_dst, B_global_src, cuda::aligned_size_t<16>(1), pipeline);In this line of code, must |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm a beginner in CUDA and am practicing sgemm. When I try to optimize data transfer using
cuda::pipelineandcuda::memcpy_async, I encounter the issue of Uncoalesced Global/shared Accesses. I suspect that my usage is incorrect, and I hope someone can help me identify the problem. Thanks very much!This image is an message from Nvidia Nsight Compute.

The following code is without
cuda::memcpy_asyncandcuda::pipeline,and all memory accessing is coalesced,each thread loads a float4, so 16 sectors per request.

The following code is with
cuda::memcpy_asyncandcuda::pipeline,the index of each thread is not change, but 24 sectors per request.This is the complete code.
Beta Was this translation helpful? Give feedback.
All reactions