1+ # CUB 2.1.0
2+
3+ ## Breaking Changes
4+
5+ - NVIDIA/cub #553 : Deprecate the ` CUB_USE_COOPERATIVE_GROUPS ` macro, as all supported CTK
6+ distributions provide CG. This macro will be removed in a future version of CUB.
7+
8+ ## New Features
9+
10+ - NVIDIA/cub #359 : Add new ` DeviceBatchMemcpy ` algorithm.
11+ - NVIDIA/cub #565 : Add ` DeviceMergeSort::StableSortKeysCopy ` API. Thanks to David Wendt (@davidwendt )
12+ for this contribution.
13+ - NVIDIA/cub #585 : Add SM90 tuning policy for ` DeviceRadixSort ` . Thanks to Andy Adinets (@canonizer )
14+ for this contribution.
15+ - NVIDIA/cub #586 : Introduce a new mechanism to opt-out of compiling CDP support in CUB algorithms by
16+ defining ` CUB_DISABLE_CDP ` .
17+ - NVIDIA/cub #589 : Support 64-bit indexing in ` DeviceReduce ` .
18+ - NVIDIA/cub #607 : Support 128-bit integers in radix sort.
19+
20+ ## Bug Fixes
21+
22+ - NVIDIA/cub #547 : Resolve several long-running issues resulting from using multiple versions of CUB
23+ within the same process. Adds an inline namespace that encodes CUB version and targeted PTX
24+ architectures.
25+ - NVIDIA/cub #562 : Fix bug in ` BlockShuffle ` resulting from an invalid thread offset. Thanks to
26+ @sjfeng1999 for this contribution.
27+ - NVIDIA/cub #564 : Fix bug in ` BlockRadixRank ` when used with blocks that are not a multiple of 32
28+ threads.
29+ - NVIDIA/cub #579 : Ensure that all threads in the logical warp participate in the index-shuffle
30+ for ` BlockRadixRank ` . Thanks to Andy Adinets (@canonizer ) for this contribution.
31+ - NVIDIA/cub #582 : Fix reordering in CUB member initializer lists.
32+ - NVIDIA/cub #589 : Fix ` DeviceSegmentedSort ` when used with ` bool ` keys.
33+ - NVIDIA/cub #590 : Fix CUB's CMake install rules. Thanks to Robert Maynard (@robertmaynard ) for this
34+ contribution.
35+ - NVIDIA/cub #592 : Fix overflow in ` DeviceReduce ` .
36+ - NVIDIA/cub #598 : Fix ` DeviceRunLengthEncode ` when the first item is a ` NaN ` .
37+ - NVIDIA/cub #611 : Fix ` WarpScanExclusive ` for vector types.
38+
39+ ## Other Enhancements
40+
41+ - NVIDIA/cub #537 : Add detailed and expanded version of
42+ a [ CUB developer overview] ( https://github.com/NVIDIA/cub/blob/main/docs/developer_overview.md ) .
43+ - NVIDIA/cub #549 : Fix ` BlockReduceRaking ` docs for non-commutative operations. Thanks to Tobias
44+ Ribizel (@upsj ) for this contribution.
45+ - NVIDIA/cub #606 : Optimize CUB's decoupled-lookback implementation.
46+
47+ # CUB 2.0.1
48+
49+ ## Other Enhancements
50+
51+ - Skip device-side synchronization on SM90+. These syncs are a debugging-only feature and not
52+ required for correctness, and a warning will be emitted if this happens.
53+
154# CUB 2.0.0
255
356## Summary
@@ -28,7 +81,7 @@ clarifying which operations can and cannot be performed in-place.
2881 - ` CUB_IS_DEVICE_CODE ` : Replace with ` NV_IF_TARGET ` .
2982 - ` CUB_INCLUDE_HOST_CODE ` : Replace with ` NV_IF_TARGET ` .
3083 - ` CUB_INCLUDE_DEVICE_CODE ` : Replace with ` NV_IF_TARGET ` .
31- - NVIDIA/cub #486 : CUB’ s CUDA Runtime support macros have been updated to
84+ - NVIDIA/cub #486 : CUB' s CUDA Runtime support macros have been updated to
3285 support ` NV_IF_TARGET ` . They are now defined consistently across all
3386 host/device compilation passes. This should not affect most usages of these
3487 macros, but may require changes for some edge cases.
@@ -53,7 +106,7 @@ clarifying which operations can and cannot be performed in-place.
53106 - RDC enabled: Macro is defined.
54107 - RDC not enabled: Macro is not defined.
55108- NVIDIA/cub #509 : A compile-time error is now emitted when a ` __device__ ` -only
56- lambda’ s return type is queried from host code (requires libcu++ ≥ 1.9.0).
109+ lambda' s return type is queried from host code (requires libcu++ ≥ 1.9.0).
57110 - Due to limitations in the CUDA programming model, the result of this query
58111 is unreliable, and will silently return an incorrect result. This leads to
59112 difficult to debug errors.
@@ -66,7 +119,7 @@ clarifying which operations can and cannot be performed in-place.
66119- NVIDIA/cub #509 : Use the result type of the binary reduction operator for
67120 accumulating intermediate results in the ` DeviceReduce ` algorithm, following
68121 guidance from http://wg21.link/P2322R6 .
69- - This change requires host-side introspection of the binary operator’ s
122+ - This change requires host-side introspection of the binary operator' s
70123 signature, and device-only extended lambda functions can no longer be used.
71124 - In addition to the behavioral changes, the interfaces for
72125 the ` Dispatch*Reduce ` layer have changed:
@@ -87,7 +140,7 @@ clarifying which operations can and cannot be performed in-place.
87140 intermediate results in the ` DeviceScan ` , ` DeviceScanByKey ` ,
88141 and ` DeviceReduceByKey ` algorithms, following guidance
89142 from http://wg21.link/P2322R6 .
90- - This change requires host-side introspection of the binary operator’ s
143+ - This change requires host-side introspection of the binary operator' s
91144 signature, and device-only extended lambda functions can no longer be used.
92145 - In addition to the behavioral changes, the interfaces for the ` Dispatch `
93146 layer have changed:
@@ -190,7 +243,7 @@ Several CUB device algorithms are documented to provide deterministic results
190243addition). Unfortunately, the implementations of these algorithms contain
191244performance optimizations that violate this guarantee.
192245The ` DeviceReduce::ReduceByKey ` and ` DeviceScan ` algorithms are known to be
193- affected. We’ re currently evaluating the scope and impact of correcting this in
246+ affected. We' re currently evaluating the scope and impact of correcting this in
194247a future CUB release. See NVIDIA/cub #471 for details.
195248
196249## Bug Fixes
@@ -244,7 +297,7 @@ updates are also included.
244297
245298### 64-bit Offsets in ` DeviceRadixSort ` Public APIs
246299
247- Users frequently want to process large datasets using CUB’ s device-scope
300+ Users frequently want to process large datasets using CUB' s device-scope
248301algorithms, but the current public APIs limit input data sizes to those that can
249302be indexed by a 32-bit integer. Beginning with this release, CUB is updating
250303these APIs to support 64-bit offsets, as discussed in NVIDIA/cub #212 .
@@ -322,7 +375,7 @@ now `SubtractLeft`, and `FlagTails` has been replaced by `SubtractRight`.
322375- NVIDIA/cub #400 : Implement a significant reduction in ` DeviceMergeSort `
323376 compilation time.
324377- NVIDIA/cub #415 : Support user-defined ` CMAKE_INSTALL_INCLUDEDIR ` values in
325- Thrust’ s CMake install rules. Thanks for @robertmaynard for this contribution.
378+ Thrust' s CMake install rules. Thanks for @robertmaynard for this contribution.
326379
327380## Bug Fixes
328381
@@ -339,7 +392,7 @@ now `SubtractLeft`, and `FlagTails` has been replaced by `SubtractRight`.
339392 gcc 10.
340393- NVIDIA/cub #423 : Fix some collisions with the ` small ` macro defined
341394 in ` windows.h ` .
342- - NVIDIA/cub #426 : Fix some issues with version handling in CUB’ s CMake packages.
395+ - NVIDIA/cub #426 : Fix some issues with version handling in CUB' s CMake packages.
343396- NVIDIA/cub #430 : Remove documentation for ` DeviceSpmv ` parameters that are
344397 absent from public APIs.
345398- NVIDIA/cub #432 : Remove incorrect documentation for ` DeviceScan ` algorithms
0 commit comments