-
Notifications
You must be signed in to change notification settings - Fork 295
Implement the new tuning API for DeviceRadixSort
#6767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
DeviceRadixSort
d16f5b0 to
3263d2c
Compare
miscco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love how much this cleans everything up
| #include <cuda/__ptx/instructions/get_sreg.h> | ||
| #include <cuda/std/__algorithm/max.h> | ||
| #include <cuda/std/__algorithm/min.h> | ||
| #include <cuda/std/__functional/operations.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #include <cuda/std/__functional/operations.h> | |
| #include <cuda/std/__type_traits/is_void.h> |
| typename DecomposerT = identity_decomposer_t> | ||
| __launch_bounds__(int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1) | ||
| __launch_bounds__(ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10}).single_tile_policy.block_threads, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this popping up more often, do we want to have
| typename DecomposerT = identity_decomposer_t> | |
| __launch_bounds__(int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1) | |
| __launch_bounds__(ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10}).single_tile_policy.block_threads, 1) | |
| typename DecomposerT = identity_decomposer_t, | |
| typename RadixSortPolicy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10})> | |
| __launch_bounds__(RadixSortPolicy.single_tile_policy.block_threads, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice try. @griwes suggested the same recently. The problem is that now we leak the current arch we compile for into the symbol name of the kernel and we get launch failures :)
We can mitigate this if we replace __launch_bounds__ by inline PTX to emit a pragma. But that's for another day.
| _CCCL_API constexpr friend bool | ||
| operator!=(const radix_sort_histogram_policy& lhs, const radix_sort_histogram_policy& rhs) | ||
| { | ||
| return !(lhs == rhs); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: We could guard all operator!= on #if _CCCL_STD_VER <= 2017
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we just wait until C++20 and refactor.
| // TODO(bgruber): implement | ||
| return {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^
66dfd43 to
b0fe51f
Compare
0d56d8d to
56c437d
Compare
|
/ok to test 56c437d |
😬 CI Workflow Results🟥 Finished in 1h 29m: Pass: 25%/98 | Total: 1d 02h | Max: 59m 36s | Hits: 80%/39461See results here. |
c583277 to
172c115
Compare
WIP
cub.test.device.radix_sort_keys.lid_0.key_bits_16passesif constexpron the onesweep algorithm in the dispatcherarch_policies_from_hubFixes: #6676