Releases: NVIDIA/cub
CUB 1.2.3
Summary
CUB 1.2.3 is a minor release.
Bug Fixes
- Fixed access violation bug in
cub::DeviceReduce::ReduceByKeyfor non-primitive value types. - Fixed code-snippet bug in
ArgIndexInputIteratorTdocumentation.
CUB 1.2.2
Summary
CUB 1.2.2 adds a new variant of cub::BlockReduce and MSVC project solections for examples.
New Features
- MSVC project solutions for device-wide and block-wide examples
- New algorithmic variant of cub::BlockReduce for improved performance when using commutative operators (e.g., numeric addition).
Bug Fixes
- Inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly.
CUB 1.2.0
Summary
CUB 1.2.0 adds cub::DeviceReduce::ReduceByKey and cub::DeviceReduce::RunLengthEncode and support for CUDA 6.0.
New Features
cub::DeviceReduce::ReduceByKey.cub::DeviceReduce::RunLengthEncode.
Other Enhancements
- Improved
cub::DeviceScan,cub::DeviceSelect,cub::DevicePartitionperformance. - Documentation and testing:
- Added performance-portability plots for many device-wide primitives.
- Explain that iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older).
- Revised the operation of temporary tile status bookkeeping for
cub::DeviceScan(and similar) to be safe for current code run on future platforms (now uses proper fences).
Bug Fixes
- Fix
cub::DeviceScanbug where Windows alignment disagreements between host and device regarding user-defined data types would corrupt tile status. - Fix
cub::BlockScanbug where certain exclusive scans on custom data types for theBLOCK_SCAN_WARP_SCANSvariant would return incorrect results for the first thread in the block. - Added workaround to make
cub::TexRefInputIteratorTwork with CUDA 6.0.
CUB 1.1.1
Summary
CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting, cub::DeviceSelect, cub::DevicePartition, cub::Shuffle*, and cub::MaxSMOccupancy. Additionally, scan and sort performance for older GPUs has been improved and many bugs have been fixed.
Breaking Changes
- Refactored block-wide I/O (
cub::BlockLoadandcub::BlockStore), removing cache-modifiers from their interfaces.cub::CacheModifiedInputIteratorandcub::CacheModifiedOutputIteratorshould now be used withcub::BlockLoadandcub::BlockStoreto effect that behavior.
New Features
cub::TexObjInputIterator,cub::TexRefInputIterator,cub::CacheModifiedInputIterator, andcub::CacheModifiedOutputIteratortypes for loading & storing arbitrary types through the cache hierarchy. They are compatible with Thrust.- Descending sorting for
cub::DeviceRadixSortandcub::BlockRadixSort. - Min, max, arg-min, and arg-max operators for
cub::DeviceReduce. cub::DeviceSelect(select-unique, select-if, and select-flagged).cub::DevicePartition(partition-if, partition-flagged).- Generic
cub::ShuffleUp,cub::ShuffleDown, andcub::ShuffleIndexfor warp-wide communication of arbitrary data types (SM3x and up). cub::MaxSmOccupancyfor accurately determining SM occupancy for any given kernel function pointer.
Other Enhancements
- Improved
cub::DeviceScanandcub::DeviceRadixSortperformance for older GPUs (SM1x to SM3x). - Renamed device-wide
stream_synchronousparam todebug_synchronousto avoid confusion about usage. - Documentation improvements:
- Added simple examples of device-wide methods.
- Improved doxygen documentation and example snippets.
- Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform).
Bug Fixes
- Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM1x.
- SHFL-based scans and reductions produced incorrect results for multi-word types (size > 4B) on Linux.
- For
cub::WarpScan-based scans, not all threads in the first warp were entering the prefix callback functor. cub::DeviceRadixSorthad a race condition with key-value pairs for pre-SM35 architectures.cub::DeviceRadixSorbitfield-extract behavior with long keys on 64-bit Linux was incorrect.cub::BlockDiscontinuityfailed to compile for types other thanint32_t/uint32_t.- CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts.
CUB 1.0.2
Summary
CUB 1.0.2 is a minor release.
Bug Fixes
- Corrections to code snippet examples for
cub::BlockLoad,cub::BlockStore, andcub::BlockDiscontinuity. - Cleaned up unnecessary/missing header includes. You can now safely include a specific .cuh (instead of
cub.cuh). - Bug/compilation fixes for
cub::BlockHistogram.
CUB 1.0.1
Summary
CUB 1.0.1 adds cub::DeviceRadixSort and cub::DeviceScan. Numerous other performance and correctness fixes and included.
Breaking Changes
- New collective interface idiom (specialize/construct/invoke).
New Features
cub::DeviceRadixSort. Implements short-circuiting for homogenous digit passes.cub::DeviceScan. Implements single-pass "adaptive-lookback" strategy.
Other Enhancements
- Significantly improved documentation (with example code snippets).
- More extensive regression test suit for aggressively testing collective variants.
- Allow non-trially-constructed types (previously unions had prevented aliasing temporary storage of those types).
- Improved support for SM3x SHFL (collective ops now use SHFL for types larger than 32 bits).
- Better code generation for 64-bit addressing within
cub::BlockLoad/cub::BlockStore. cub::DeviceHistogramnow supports histograms of arbitrary bins.- Updates to accommodate CUDA 5.5 dynamic parallelism.
Bug Fixes
- Workarounds for SM10 codegen issues in uncommonly-used
cub::WarpScan/cub::WarpReducespecializations.
CUB 0.9.4
Summary
CUB 0.9.3 is a minor release.
Enhancements
- Various documentation updates and corrections.
Bug Fixes
- Fixed compilation errors for SM1x.
- Fixed compilation errors for some WarpScan entrypoints on SM3x and up.
CUB 0.9.3
Summary
CUB 0.9.3 adds histogram algorithms and work management utility descriptors.
New Features
cub::DevicHistogram256.cub::BlockHistogram256.cub::BlockScanalgorithm variantBLOCK_SCAN_RAKING_MEMOIZE, which trades more register consumption for less shared memory I/O.cub::GridQueue,cub::GridEvenShare, work management utility descriptors.
Other Enhancements
- Updates to
cub::BlockRadixRankto usecub::BlockScan, which improves performance on SM3x by using SHFL. - Allow types other than builtin types to be used in
cub::WarpScan::*Summethods if they only haveoperator+overloaded. Previously they also required to support assignment fromint(0). - Update
cub::BlockReduce'sBLOCK_REDUCE_WARP_REDUCTIONSalgorithm to work even when block size is not an even multiple of warp size. - Refactoring of
cub::DeviceAllocatorinterface andcub::CachingDeviceAllocatorimplementation.
CUB 0.9.2
Summary
CUB 0.9.2 adds cub::WarpReduce.
New Features
cub::WarpReduce, which uses the SHFL instruction when applicable.cub::BlockReducenow uses thiscub::WarpReduceinstead of implementing its own.
Enhancements
- Documentation updates and corrections.
Bug Fixes
- Fixes for 64-bit Linux compilation warnings and errors.
CUB 0.9.1
Summary
CUB 0.9.1 is a minor release.
Bug Fixes
- Fix for ambiguity in
cub::BlockScan::Reducebetween generic reduction and summation. Summation entrypoints are now called::Sum(), similar to the convention incub::BlockScan. - Small edits to documentation and download tracking.