Releases: NVIDIA/cub
CUB 2.1.0
Breaking Changes
- #553: Deprecate the
CUB_USE_COOPERATIVE_GROUPSmacro, as all supported CTK distributions provide CG. This macro will be removed in a future version of CUB.
New Features
- #359: Add new
DeviceBatchMemcpyalgorithm. - #565: Add
DeviceMergeSort::StableSortKeysCopyAPI. Thanks to David Wendt (@davidwendt) for this contribution. - #585: Add SM90 tuning policy for
DeviceRadixSort. Thanks to Andy Adinets (@canonizer) for this contribution. - #586: Introduce a new mechanism to opt-out of compiling CDP support in CUB algorithms by defining
CUB_DISABLE_CDP. - #589: Support 64-bit indexing in
DeviceReduce. - #607: Support 128-bit integers in radix sort.
Bug Fixes
- #547: Resolve several long-running issues resulting from using multiple versions of CUB within the same process. Adds an inline namespace that encodes CUB version and targeted PTX architectures.
- #562: Fix bug in
BlockShuffleresulting from an invalid thread offset. Thanks to @sjfeng1999 for this contribution. - #564: Fix bug in
BlockRadixRankwhen used with blocks that are not a multiple of 32 threads. - #579: Ensure that all threads in the logical warp participate in the index-shuffle for
BlockRadixRank. Thanks to Andy Adinets (@canonizer) for this contribution. - #582: Fix reordering in CUB member initializer lists.
- #589: Fix
DeviceSegmentedSortwhen used withboolkeys. - #590: Fix CUB’s CMake install rules. Thanks to Robert Maynard (@robertmaynard) for this contribution.
- #592: Fix overflow in
DeviceReduce. - #598: Fix
DeviceRunLengthEncodewhen the first item is aNaN. - #611: Fix
WarpScanExclusivefor vector types.
Other Enhancements
- #537: Add detailed and expanded version of a [CUB developer overview](https://github.com/NVIDIA/cub/blob/main/docs/developer_overview.md).
- #549: Fix
BlockReduceRakingdocs for non-commutative operations. Thanks to Tobias Ribizel (@upsj) for this contribution. - #606: Optimize CUB’s decoupled-lookback implementation.
CUB 2.0.1
Other Enhancements
- Skip device-side synchronization on SM90+. These syncs are a debugging-only feature and not required for correctness, and a warning will be emitted if this happens.
CUB 1.17.2
Summary
CUB 1.17.2 is a minor bugfix release.
- #547: Introduce an annotated inline namespace to prevent issues with
collisions and mismatched kernel configurations across libraries. The new
namespace encodes the CUB version and target SM architectures.
CUB 2.0.0
Summary
The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous debugging flags.
This release also includes several new features. DeviceHistogram now supports __half and better handles various edge cases. WarpReduce now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort learned to handle the case where begin_bit == end_bit.
Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.
Breaking Changes
- #448 Add libcu++ dependency (v1.8.0+).
- #448: The following macros are no longer defined by default. They can be re-enabled by defining
CUB_PROVIDE_LEGACY_ARCH_MACROS. These will be completely removed in a future release.CUB_IS_HOST_CODE: Replace withNV_IF_TARGET.CUB_IS_DEVICE_CODE: Replace withNV_IF_TARGET.CUB_INCLUDE_HOST_CODE: Replace withNV_IF_TARGET.CUB_INCLUDE_DEVICE_CODE: Replace withNV_IF_TARGET.
- #486: CUB’s CUDA Runtime support macros have been updated to support
NV_IF_TARGET. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.CUB_RUNTIME_FUNCTION: Execution space annotations for functions that invoke CUDA Runtime APIs.- Old behavior:
- RDC enabled: Defined to
__host__ __device__ - RDC not enabled:
- NVCC host pass: Defined to
__host__ __device__ - NVCC device pass: Defined to
__host__
- NVCC host pass: Defined to
- RDC enabled: Defined to
- New behavior:
- RDC enabled: Defined to
__host__ __device__ - RDC not enabled: Defined to
__host__
- RDC enabled: Defined to
- Old behavior:
CUB_RUNTIME_ENABLED: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:- RDC enabled: Macro is defined.
- RDC not enabled:
- NVCC host pass: Macro is defined.
- NVCC device pass: Macro is not defined.
CUB_RDC_ENABLED: New macro, may be combined withNV_IF_TARGETto replace most usages ofCUB_RUNTIME_ENABLED. Behavior:- RDC enabled: Macro is defined.
- RDC not enabled: Macro is not defined.
- #509: A compile-time error is now emitted when a
__device__-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).- Due to limitations in the CUDA programming model, the result of this query is unreliable, and will silently return an incorrect result. This leads to difficult to debug errors.
- When using libcu++ 1.9.0, an error will be emitted with information about work-arounds:
- Use a named function object with a
__device__-only implementation ofoperator(). - Use a
__host__ __device__lambda. - Use
cuda::proclaim_return_type(Added in libcu++ 1.9.0)
- Use a named function object with a
- #509: Use the result type of the binary reduction operator for accumulating intermediate results in the
DeviceReducealgorithm, following guidance from http://wg21.link/P2322R6.- This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the
Dispatch*Reducelayer have changed:DispatchReduce:- Now accepts accumulator type as last parameter.
- Now accepts initializer type instead of output iterator value type.
- Constructor now accepts
initas initial type instead of output iterator value type.
DispatchSegmentedReduce:- Accepts accumulator type as last parameter.
- Accepts initializer type instead of output iterator value type.
- Thread operators now accept parameters using different types:
Equality,Inequality,InequalityWrapper,Sum,Difference,Division,Max,ArgMax,Min,ArgMin. ThreadReducenow accepts accumulator type and uses a different type forprefix.
- #511: Use the result type of the binary operator for accumulating intermediate results in the
DeviceScan,DeviceScanByKey, andDeviceReduceByKeyalgorithms, following guidance from http://wg21.link/P2322R6.- This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
- In addition to the behavioral changes, the interfaces for the
Dispatchlayer have changed:DispatchScannow accepts accumulator type as a template parameter.DispatchScanByKeynow accepts accumulator type as a template parameter.DispatchReduceByKeynow accepts accumulator type as the last template parameter.
- #527: Deprecate the
debug_synchronousflags on device algorithms.- This flag no longer has any effect. Define
CUB_DEBUG_SYNCduring compilation to enable these checks. - Moving this option from run-time to compile-time avoids the compilation overhead of unused debugging paths in production code.
- This flag no longer has any effect. Define
New Features
- #514: Support
__halfinDeviceHistogram. - #516: Add support for single-threaded invocations of
WarpReduce. - #516: Use
__reduce_add_synchardware acceleration forWarpReduceon supported architectures.
Bug Fixes
- #481: Fix the device-wide radix sort implementations to simply copy the input to the output when
begin_bit == end_bit. - #487: Fix
DeviceHistogram::Evenfor a variety of edge cases:- Bin ids are now correctly computed when mixing different types for
SampleTandLevelT. - Bin ids are now correctly computed when
LevelTis an integral type and the number of levels does not evenly divide the level range.
- Bin ids are now correctly computed when mixing different types for
- #508: Ensure that
temp_storage_bytesis properly set in theAdjacentDifferenceCopydevice algorithms. - #508: Remove excessive calls to the binary operator given to the
AdjacentDifferenceCopydevice algorithms. - #533: Fix debugging utilities when RDC is disabled.
Other Enhancements
- #448: Removed special case code for unsupported CUDA architectures.
- #448: Replace several usages of
__CUDA_ARCH__with<nv/target>to handle host/device code divergence. - #448: Mark unused PTX arch parameters as legacy.
- #476: Enabled additional debug logging for the onesweep radix sort implementation. Thanks to @canonizer for this contribution.
- #480: Add
CUB_DISABLE_BF16_SUPPORTto avoid including thecuda_bf16.hheader or using the__nv_bfloat16type. - #486: Add debug log messages for post-kernel debug synchronizations.
- #490: Clarify documentation for in-place usage of
DeviceScanalgorithms. - #494: Clarify documentation for in-place usage of
DeviceHistogramalgorithms. - #495: Clarify documentation for in-place usage of
DevicePartitionalgorithms. - #499: Clarify documentation for in-place usage of
Device*Sortalgorithms. - #500: Clarify documentation for in-place usage of
DeviceReducealgorithms. - #501: Clarify documentation for in-place usage of
DeviceRunLengthEncodealgorithms. - #503: Clarify documentation for in-place usage of
DeviceSelectalgorithms. - #518: Fix typo in
WarpMergeSortdocumentation. - #519: Clarify segmented sort documentation regarding the handling of elements that are not included in any segment.
CUB 1.17.1
Summary
CUB 1.17.1 is a minor bugfix release.
CUB 1.17.0
CUB 1.17.0
Summary
CUB 1.17.0 is the final minor release of the 1.X series. It provides a variety of bug fixes and miscellaneous enhancements, detailed below.
Known Issues
“Run-to-run” Determinism Broken
Several CUB device algorithms are documented to provide deterministic results (per device) for non-associative reduction operators (e.g. floating-point addition). Unfortunately, the implementations of these algorithms contain performance optimizations that violate this guarantee. The DeviceReduce::ReduceByKey and DeviceScan algorithms are known to be affected. We’re currently evaluating the scope and impact of correcting this in a future CUB release. See NVIDIA/cub#471 for details.
Bug Fixes
- #444: Fixed
DeviceSelectto work with discard iterators and mixed input/output types. - #452: Fixed install issue when
CMAKE_INSTALL_LIBDIRcontained nested directories. Thanks to @robertmaynard for this contribution. - #462: Fixed bug that produced incorrect results from
DeviceSegmentedSorton sm_61 and sm_70. - #464: Fixed
DeviceSelect::Flaggedso that flags are normalized to 0 or 1. - #468: Fixed overflow issues in
DeviceRadixSortgivennum_itemsclose to 2^32. Thanks to @canonizer for this contribution. - #498: Fixed compiler regression in
BlockAdjacentDifference. Thanks to @MKKnorr for this contribution.
Other Enhancements
- #445: Remove device-sync in
DeviceSegmentedSortwhen launched via CDP. - #449: Fixed invalid link in documentation. Thanks to @kshitij12345 for this contribution.
- #450:
BlockDiscontinuity: Replaced recursive-template loop unrolling with#pragma unroll. Thanks to @kshitij12345 for this contribution. - #451: Replaced the deprecated
TexRefInputIteratorimplementation with an alias toTexObjInputIterator. This fully removes all usages of the deprecated CUDA texture reference APIs from CUB. - #456:
BlockAdjacentDifference: Replaced recursive-template loop unrolling with#pragma unroll. Thanks to @kshitij12345 for this contribution. - #466:
cub::DeviceAdjacentDifferenceAPI has been updated to use the newOffsetTdeduction approach described in #212. - #470: Fix several doxygen-related warnings. Thanks to @karthikeyann for this contribution.
CUB 1.16.0
Summary
CUB 1.16.0 is a major release providing several improvements to the device scope algorithms. DeviceRadixSort now supports large (64-bit indexed) input data. A new UniqueByKey algorithm has been added to DeviceSelect. DeviceAdjacentDifference provides new SubtractLeft and SubtractRight functionality.
This release also deprecates several obsolete APIs, including type traits and BlockAdjacentDifference algorithms. Many bugfixes and documentation updates are also included.
64-bit Offsets in DeviceRadixSort Public APIs
Users frequently want to process large datasets using CUB’s device-scope algorithms, but the current public APIs limit input data sizes to those that can be indexed by a 32-bit integer. Beginning with this release, CUB is updating these APIs to support 64-bit offsets, as discussed in #212.
The device-scope algorithms will be updated with 64-bit offset support incrementally, starting with the cub::DeviceRadixSort family of algorithms. Thanks to @canonizer for contributing this functionality.
New DeviceSelect::UniqueByKey Algorithm
cub::DeviceSelect now provides a UniqueByKey algorithm, which has been ported from Thrust. Thanks to @zasdfgbnm for this contribution.
New DeviceAdjacentDifference Algorithms
The new cub::DeviceAdjacentDifference interface, also ported from Thrust, provides SubtractLeft and SubtractRight algorithms as CUB kernels.
Deprecation Notices
Synchronous CUDA Dynamic Parallelism Support
A future version of CUB will change the debug_synchronous behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).
This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true. Such invocations will continue to print extra debugging information, but they will no longer synchronize after kernel launches.
Deprecated Traits
CUB provided a variety of metaprogramming type traits in order to support C++03. Since C++14 is now required, these traits have been deprecated in favor of their STL equivalents, as shown below:
| Deprecated CUB Trait | Replacement STL Trait |
|---|---|
| cub::If | std::conditional |
| cub::Equals | std::is_same |
| cub::IsPointer | std::is_pointer |
| cub::IsVolatile | std::is_volatile |
| cub::RemoveQualifiers | std::remove_cv |
| cub::EnableIf | std::enable_if |
CUB now uses the STL traits internally, resulting in a ~6% improvement in compile time.
Misnamed cub::BlockAdjacentDifference APIs
The algorithms in cub::BlockAdjacentDifference have been deprecated, as their names did not clearly describe their intent. The FlagHeads method is now SubtractLeft, and FlagTails has been replaced by SubtractRight.
Breaking Changes
- #331: Deprecate the misnamed
BlockAdjacentDifference::FlagHeadsandFlagTailsmethods. Use the newSubtractLeftandSubtractRightmethods instead. - #364: Deprecate some obsolete type traits. These should be replaced by the equivalent traits in
<type_traits>as described above.
New Features
- #331: Port the
thrust::adjacent_differencekernel and expose it ascub::DeviceAdjacentDifference. - #405: Port the
thrust::unique_by_keykernel and expose it ascub::DeviceSelect::UniqueByKey. Thanks to @zasdfgbmn for this contribution.
Enhancements
- #340: Allow 64-bit offsets in
DeviceRadixSortpublic APIs. Thanks to @canonizer for this contribution. - #400: Implement a significant reduction in
DeviceMergeSortcompilation time. - #415: Support user-defined
CMAKE_INSTALL_INCLUDEDIRvalues in Thrust’s CMake install rules. Thanks for @robertmaynard for this contribution.
Bug Fixes
- #381: Fix shared memory alignment in
dyn_smemexample. - #393: Fix some collisions with the
min/maxmacros defined inwindows.h. - #404: Fix bad cast in
util_device. - #410: Fix CDP issues in
DeviceSegmentedSort. - #411: Ensure that the
nv_exec_check_disablepragma is only used on nvcc. - #418: Fix
-Wsizeof-array-divwarning on gcc 11. Thanks to @robertmaynard for this contribution. - #420: Fix new uninitialized variable warning in
DiscardIteratoron gcc 10. - #423: Fix some collisions with the
smallmacro defined inwindows.h. - #426: Fix some issues with version handling in CUB’s CMake packages.
- #430: Remove documentation for
DeviceSpmvparameters that are absent from public APIs. - #432: Remove incorrect documentation for
DeviceScanalgorithms that guaranteed run-to-run deterministic results for floating-point addition.
CUB 1.15.0
Summary
CUB 1.15.0 includes a new cub::DeviceSegmentedSort algorithm, which demonstrates up to 5000x speedup compared to cub::DeviceSegmentedRadixSort when sorting a large number of small segments. A new cub::FutureValue<T> helper allows the cub::DeviceScan algorithms to lazily load the initial_value from a pointer. cub::DeviceScan also added ScanByKey functionality.
The new DeviceSegmentedSort algorithm partitions segments into size groups. Each group is processed with specialized kernels using a variety of sorting algorithms. This approach varies the number of threads allocated for sorting each segment and utilizes the GPU more efficiently.
cub::FutureValue<T> provides the ability to use the result of a previous kernel as a scalar input to a CUB device-scope algorithm without unnecessary synchronization:
int *d_intermediate_result = ...;
intermediate_kernel<<<blocks, threads>>>(d_intermediate_result, // output
arg1, // input
arg2); // input
// Wrap the intermediate pointer in a FutureValue -- no need to explicitly
// sync when both kernels are stream-ordered. The pointer is read after
// the ExclusiveScan kernel starts executing.
cub::FutureValue<int> init_value(d_intermediate_result);
cub::DeviceScan::ExclusiveScan(d_temp_storage,
temp_storage_bytes,
d_in,
d_out,
cub::Sum(),
init_value,
num_items);Previously, an explicit synchronization would have been necessary to obtain the intermediate result, which was passed by value into ExclusiveScan. This new feature enables better performance in workflows that use cub::DeviceScan.
Deprecation Notices
A future version of CUB will change the debug_synchronous behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).
This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true. These algorithms will continue to print extra debugging information, but they will no longer synchronize after kernel launches.
Breaking Changes
- #305: The template parameters of
cub::DispatchScanhave changed to support the newcub::FutureValuehelper. More details under "New Features". - #377: Remove broken
operator->()fromcub::TransformInputIterator, since this cannot be implemented without returning a temporary object's address. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
New Features
- #305: Add overloads to
cub::DeviceScanalgorithms that allow the output of a previous kernel to be used asinitial_valuewithout explicit synchronization. See the newcub::FutureValuehelper for details. Thanks to Xiang Gao (@zasdfgbnm) for this contribution. - #354: Add
cub::BlockRunLengthDecodealgorithm. Thanks to Elias Stehle (@elstehle) for this contribution. - #357: Add
cub::DeviceSegmentedSort, an optimized version ofcub::DeviceSegmentedSortwith improved load balancing and small array performance. - #376: Add "by key" overloads to
cub::DeviceScan. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
Bug Fixes
- #349: Doxygen and unused variable fixes.
- #363: Maintenance updates for the new
cub::DeviceMergeSortalgorithms. - #382: Fix several
-Wconversionwarnings. Thanks to Matt Stack (@matt-stack) for this contribution. - #388: Fix debug assertion on MSVC when using
cub::CachingDeviceAllocator. - #395: Support building with
__CUDA_NO_HALF_CONVERSIONS__. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
CUB 1.14.0 (NVIDIA HPC SDK 21.9)
CUB 1.14.0 is a major release accompanying the NVIDIA HPC SDK 21.9.
This release provides the often-requested merge sort algorithm, ported from the thrust::sort implementation. Merge sort provides more flexibility than the existing radix sort by supporting arbitrary data types and comparators, though radix sorting is still faster for supported inputs. This functionality is provided through the new cub::DeviceMergeSort and cub::BlockMergeSort algorithms.
The namespace wrapping mechanism has been overhauled for 1.14. The existing macros (CUB_NS_PREFIX/CUB_NS_POSTFIX) can now be replaced by a single macro, CUB_WRAPPED_NAMESPACE, which is set to the name of the desired wrapped namespace. Defining a similar THRUST_CUB_WRAPPED_NAMESPACE macro will embed both thrust:: and cub:: symbols in the same external namespace. The prefix/postfix macros are still supported, but now require a new CUB_NS_QUALIFIER macro to be defined, which provides the fully qualified CUB namespace (e.g. ::foo::cub). See cub/util_namespace.cuh for details.
Breaking Changes
- #350: When the
CUB_NS_[PRE|POST]FIXmacros are set,CUB_NS_QUALIFIERmust also be defined to the fully qualified CUB namespace (e.g.#define CUB_NS_QUALIFIER ::foo::cub). Note that this is handled automatically when using the new[THRUST_]CUB_WRAPPED_NAMESPACEmechanism.
New Features
- #322: Ported the merge sort algorithm from Thrust;
cub::BlockMergeSortandcub::DeviceMergeSortare now available. - #326: Simplify the namespace wrapper macros, and detect when Thrust's symbols are in a wrapped namespace.
Bug Fixes
- #160, #163, #352: Fixed several bugs in
cub::DeviceSpmvand added basic tests for this algorithm. Thanks to James Wyles and Seunghwa Kang for their contributions. - #328: Fixed error handling bug and incorrect debugging output in
cub::CachingDeviceAllocator. Thanks to Felix Kallenborn for this contribution. - #335: Fixed a compile error affecting clang and NVRTC. Thanks to Jiading Guo for this contribution.
- #351: Fixed some errors in the
cub::DeviceHistogramdocumentation.
Enhancements
- #348: Add an example that demonstrates how to use dynamic shared memory with a CUB block algorithm. Thanks to Matthias Jouanneaux for this contribution.
CUB 1.13.1 (CUDA Toolkit 11.5)
CUB 1.13.1 is a minor release accompanying the CUDA Toolkit 11.5.
This release provides a new hook for embedding the cub:: namespace inside
a custom namespace. This is intended to work around various issues related to
linking multiple shared libraries that use CUB. The existing CUB_NS_PREFIX and
CUB_NS_POSTFIX macros already provided this capability; this update provides a
simpler mechanism that is extended to and integrated with Thrust. Simply define
THRUST_CUB_WRAPPED_NAMESPACE to a namespace name, and both thrust:: and
cub:: will be placed inside the new namespace. Using different wrapped
namespaces for each shared library will prevent issues like those reported in
NVIDIA/thrust#1401.
New Features
- #326: Add
THRUST_CUB_WRAPPED_NAMESPACEhooks.