-
Notifications
You must be signed in to change notification settings - Fork 294
Implement CUDA backend for parallel cuda::std::for_each
#5610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
9237589 to
2dad112
Compare
|
/ok to test |
🟨 CI finished in 2h 19m: Pass: 74%/210 | Total: 2d 21h | Avg: 19m 45s | Max: 2h 02m | Hits: 65%/210361
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| +/- | CUDA Experimental |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 210)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 10 | linux-amd64-gpu-h100-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
2dad112 to
04640ea
Compare
|
/ok to test 04640ea |
🟩 CI finished in 2h 22m: Pass: 100%/210 | Total: 4d 21h | Avg: 33m 34s | Max: 2h 02m | Hits: 60%/324676
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| +/- | CUDA Experimental |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 210)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 10 | linux-amd64-gpu-h100-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
04640ea to
501d5d7
Compare
for_eachfor_each
for_eachfor_each implementation in libcu++
cdf48f8 to
2b9f630
Compare
libcudacxx/test/libcudacxx/std/algorithms/alg.nonmodifying/alg.for_each/pstl.for_each.pass.cpp
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
6dece7c to
bba027b
Compare
This comment has been minimized.
This comment has been minimized.
bba027b to
98f8d25
Compare
This comment has been minimized.
This comment has been minimized.
98f8d25 to
c1863aa
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
c1863aa to
fb8d162
Compare
This comment has been minimized.
This comment has been minimized.
| static_assert(__always_false_v<_Policy>, | ||
| "__pstl_dispatch: CUDA backend of cuda::std::for_each_n requires at least random access iterators"); | ||
| return ::cuda::std::for_each_n(::cuda::std::move(__first), __orig_n, ::cuda::std::move(__func)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Do we want to remove the serial fallback here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would not compile, because it would complain about a missing return.
The static assert will fire in case of a bad instantiation but we want the other part to go through
gevtushenko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a few suggestions below, but otherwise looks good! What would help is having some documentation on the policy and dispatch system. If you could write a developer guide as a follow up PR, that'd be great.
| }; | ||
|
|
||
| //! @brief Extracts the execution backend from the stored _Policy | ||
| template <uint32_t _Policy> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: it feels like we are missing some other concept here. We have policy: how code should run (sequenced, parallel, unsequenced, etc.) and backend: where code should run (cuda, omp, tbb, etc.). In this template parameter, we have "something" that's policy + backend, which is also called policy. This makes it hard to understand what exactly we are working with. The only information that disambiguates the two is uint32 vs uint8, but it's hard to spot. If you could come up with some different term for policy + backend, that'd be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this a ton and I did not find a truly better name.
What i would say is that there is a difference between the execution policy and what I would consider the whole policy. The former is what the standard gives us as classification of how code should run. The latter is how we implement that for the different backends.
So yeah it is a tad close, but I did not find a better name.
|
|
||
| # include <cuda/std/__cccl/prologue.h> | ||
|
|
||
| _CCCL_BEGIN_NAMESPACE_CUDA_STD_EXECUTION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (@srinivasyadav18): this is an algorithm implementation, should this implementation detail be part of cuda::std:: instead of cuda::std::execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the right place given that the execution policies are part of cuda::std::execution
This is also more the backend specific implementation than the algorithm itself
fb8d162 to
ffb96c1
Compare
This comment has been minimized.
This comment has been minimized.
ffb96c1 to
dd66c04
Compare
This comment has been minimized.
This comment has been minimized.
bernhardmgruber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with leaving some questions on the execution policy design open and wait until we gather more experience. Those are internal types. The public APIs LGTM and the implementation seems reasonable.
dd66c04 to
b4226f8
Compare
We currently tag our execution policies with just an enumeration for that represents the standard execution policies. However, this is not sufficient for our use cases, because we also want to pass along the execution backend and the memory_direction of the algorithm. This changes our policies so that they take an unsigned integer instead of an enumeration and then adds facilities to set and get the respective properties
b4226f8 to
7802f25
Compare
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 9h 35m: Pass: 100%/166 | Total: 8d 21h | Max: 5h 09m | Hits: 22%/281928See results here. |
This provides a design for a dispatch layer for parallel algorithms in libcu++
We leverage the enumeration that is stored within the execution policies to determine the user-choosen backend and if possible dispatch to that.
If no backend has a suitable implementation of the parallel algorithm we fall back to the serial implementation