Skip to content

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Nov 24, 2025

WIP

  • cub.test.device.radix_sort_keys.lid_0.key_bits_16 passes
  • CCCL.C tests pass
  • Retain the if constexpr on the onesweep algorithm in the dispatcher
  • tests for supplying a custom policy hub
  • implement arch_policies_from_hub

Fixes: #6676

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 24, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bernhardmgruber bernhardmgruber changed the title Implement the new tuning API for DeviceRadixSort Implement the new tuning API for DeviceRadixSort Nov 24, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 24, 2025
Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how much this cleans everything up

#include <cuda/__ptx/instructions/get_sreg.h>
#include <cuda/std/__algorithm/max.h>
#include <cuda/std/__algorithm/min.h>
#include <cuda/std/__functional/operations.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <cuda/std/__functional/operations.h>
#include <cuda/std/__type_traits/is_void.h>

Comment on lines 337 to +338
typename DecomposerT = identity_decomposer_t>
__launch_bounds__(int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1)
__launch_bounds__(ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10}).single_tile_policy.block_threads, 1)
Copy link
Contributor

@miscco miscco Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this popping up more often, do we want to have

Suggested change
typename DecomposerT = identity_decomposer_t>
__launch_bounds__(int(ChainedPolicyT::ActivePolicy::SingleTilePolicy::BLOCK_THREADS), 1)
__launch_bounds__(ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10}).single_tile_policy.block_threads, 1)
typename DecomposerT = identity_decomposer_t,
typename RadixSortPolicy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10})>
__launch_bounds__(RadixSortPolicy.single_tile_policy.block_threads, 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice try. @griwes suggested the same recently. The problem is that now we leak the current arch we compile for into the symbol name of the kernel and we get launch failures :)

We can mitigate this if we replace __launch_bounds__ by inline PTX to emit a pragma. But that's for another day.

Comment on lines +134 to +138
_CCCL_API constexpr friend bool
operator!=(const radix_sort_histogram_policy& lhs, const radix_sort_histogram_policy& rhs)
{
return !(lhs == rhs);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: We could guard all operator!= on #if _CCCL_STD_VER <= 2017

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we just wait until C++20 and refactor.

Comment on lines +105 to +106
// TODO(bgruber): implement
return {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^

@bernhardmgruber
Copy link
Contributor Author

/ok to test 56c437d

@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 1h 29m: Pass: 25%/98 | Total: 1d 02h | Max: 59m 36s | Hits: 80%/39461

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Implement a MVP for cub::DeviceRadixSort using the new tuning API

2 participants