You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 21, 2024. It is now read-only.
* \tparam BLOCK_THREADS The threadblock size in threads
92
107
* \tparam ITEMS_PER_THREAD The number of items per thread
93
-
* \tparam ALGORITHM <b>[optional]</b> cub::BlockHisto256Algorithm enumerator specifying the underlying algorithm to use (default = cub::BLOCK_BYTE_HISTO_SORT)
108
+
* \tparam ALGORITHM <b>[optional]</b> cub::BlockHisto256Algorithm enumerator specifying the underlying algorithm to use (default = cub::BLOCK_HISTO_256_SORT)
94
109
*
95
110
* \par Algorithm
96
111
* BlockHisto256 can be (optionally) configured to use different algorithms:
97
-
* -# <b>cub::BLOCK_BYTE_HISTO_SORT</b>. Sorting followed by differentiation. [More...](\ref cub::BlockHisto256Algorithm)
98
-
* -# <b>cub::BLOCK_BYTE_HISTO_ATOMIC</b>. Use atomic addition to update byte counts directly. [More...](\ref cub::BlockHisto256Algorithm)
112
+
* -# <b>cub::BLOCK_HISTO_256_SORT</b>. Sorting followed by differentiation. [More...](\ref cub::BlockHisto256Algorithm)
113
+
* -# <b>cub::BLOCK_HISTO_256_ATOMIC</b>. Use atomic addition to update byte counts directly. [More...](\ref cub::BlockHisto256Algorithm)
99
114
*
100
115
* \par Usage Considerations
101
116
* - The histogram output can be constructed in shared or global memory
InputIteratorRA block_itr, ///< [in] The threadblock's base input iterator for loading from
352
-
constint &guarded_items, ///< [in] Number of valid items in the tile
353
-
T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
354
-
int stride = blockDim.x) ///< [in] <b>[optional]</b> Stripe stride. Default is the width of the threadblock. More efficient code can be generated if a compile-time-constant (e.g., BLOCK_THREADS) is supplied.
351
+
InputIteratorRA block_itr, ///< [in] The threadblock's base input iterator for loading from
352
+
constint &guarded_items, ///< [in] Number of valid items in the tile
353
+
T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
354
+
int stride = blockDim.x) ///< [in] <b>[optional]</b> Stripe stride. Default is the width of the threadblock. More efficient code can be generated if a compile-time-constant (e.g., BLOCK_THREADS) is supplied.
InputIteratorRA block_itr, ///< [in] The threadblock's base input iterator for loading from
413
-
constint &guarded_items, ///< [in] Number of valid items in the tile
414
-
T oob_default, ///< [in] Default value to assign out-of-bound items
415
-
T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
416
-
int stride = blockDim.x) ///< [in] <b>[optional]</b> Stripe stride. Default is the width of the threadblock. More efficient code can be generated if a compile-time-constant (e.g., BLOCK_THREADS) is supplied.
412
+
InputIteratorRA block_itr, ///< [in] The threadblock's base input iterator for loading from
413
+
constint &guarded_items, ///< [in] Number of valid items in the tile
414
+
T oob_default, ///< [in] Default value to assign out-of-bound items
415
+
T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load
416
+
int stride = blockDim.x) ///< [in] <b>[optional]</b> Stripe stride. Default is the width of the threadblock. More efficient code can be generated if a compile-time-constant (e.g., BLOCK_THREADS) is supplied.
* BlockReduceAlgorithm enumerates alternative algorithms for parallel
55
60
* reduction across a CUDA threadblock.
@@ -59,9 +64,13 @@ enum BlockReduceAlgorithm
59
64
60
65
/**
61
66
* \par Overview
62
-
* An efficient "raking" reduction algorithm. Execution is comprised of three phases:
63
-
* -# Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
64
-
* -# Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
67
+
* An efficient "raking" reduction algorithm. Execution is comprised of
68
+
* three phases:
69
+
* -# Upsweep sequential reduction in registers (if threads contribute more
70
+
* than one input each). Each thread then places the partial reduction
71
+
* of its item(s) into shared memory.
72
+
* -# Upsweep sequential reduction in shared memory. Threads within a
73
+
* single warp rake across segments of shared partial reductions.
65
74
* -# A warp-synchronous Kogge-Stone style reduction within the raking warp.
66
75
*
67
76
* \par
@@ -78,24 +87,34 @@ enum BlockReduceAlgorithm
78
87
79
88
/**
80
89
* \par Overview
81
-
* A quick "tiled warp-reductions" reduction algorithm. Execution is comprised of four phases:
82
-
* -# Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
83
-
* -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style reduction within each warp.
84
-
* -# A propagation phase where the warp reduction outputs in each warp are updated with the aggregate from each preceding warp.
90
+
* A quick "tiled warp-reductions" reduction algorithm. Execution is
91
+
* comprised of four phases:
92
+
* -# Upsweep sequential reduction in registers (if threads contribute more
93
+
* than one input each). Each thread then places the partial reduction
94
+
* of its item(s) into shared memory.
95
+
* -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style
96
+
* reduction within each warp.
97
+
* -# A propagation phase where the warp reduction outputs in each warp are
98
+
* updated with the aggregate from each preceding warp.
85
99
*
86
100
* \par
87
101
* \image html block_scan_warpscans.png
88
102
* <div class="centercaption">\p BLOCK_REDUCE_WARP_REDUCTIONS data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.</div>
89
103
*
90
104
* \par Performance Considerations
91
105
* - Although this variant may suffer lower overall throughput across the
92
-
* GPU because due to a heavy reliance on inefficient warp-reductions, it can
93
-
* often provide lower turnaround latencies when the GPU is under-occupied.
106
+
* GPU because due to a heavy reliance on inefficient warp-reductions, it
107
+
* can often provide lower turnaround latencies when the GPU is
0 commit comments