|
85 | 85 | * <br> |
86 | 86 | * <a href="https://github.com/NVlabs/cub"><img src="github-icon-747d8b799a48162434b2c0595ba1317e.png" style="position:relative; bottom:-10px; border:0px;"/></a> |
87 | 87 | * |
88 | | - * <a href="https://github.com/NVlabs/cub">Fork CUB at GitHub!</a> |
| 88 | + * <a href="https://github.com/NVlabs/cub">Browse or fork CUB at GitHub!</a> |
89 | 89 | * <br> |
90 | 90 | * <a href="http://groups.google.com/group/cub-users"><img src="groups-icon.png" style="position:relative; bottom:-10px; border:0px;"/></a> |
91 | 91 | * |
|
96 | 96 | * |
97 | 97 | * \par |
98 | 98 | * CUB is a library of high-performance parallel primitives and other utilities for |
99 | | - * building CUDA kernel software. CUB enhances productivity, performance, and portability |
| 99 | + * constructing CUDA kernel software. CUB enhances productivity, performance, and portability |
100 | 100 | * by providing an abstraction layer over complex |
101 | 101 | * [block-level] (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model), |
102 | 102 | * [warp-level] (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation), and |
|
105 | 105 | * \par |
106 | 106 | * CUB's primitives are not bound to any particular width of parallelism or to any particular |
107 | 107 | * data type. This allows them to be flexible and tunable to fit your kernels' needs. |
108 | | - * Thus CUB is [<b>C</b>UDA <b>U</b>n<b>b</b>ound](index.html). |
| 108 | + * Thus CUB is [<em>CUDA Unbound</em>](index.html). |
109 | 109 | * |
110 | 110 | * \image html cub_overview.png |
111 | 111 | * |
|
201 | 201 | * library and software abstraction layer, CUB provides: |
202 | 202 | * -# <b>Simplicity of composition.</b> Parallel CUB primitives can be simply sequenced |
203 | 203 | * together in kernel code. (This convenience is analogous to programming with |
204 | | - * [<b>Thrust</b>](http://thrust.github.com/) primitives in the host program.) |
| 204 | + * [<b><em>Thrust</em></b>](http://thrust.github.com/) primitives in the host program.) |
205 | 205 | * -# <b>High performance.</b> CUB simplifies high performance kernel development by |
206 | 206 | * taking care to implement and make available the fastest available algorithms, |
207 | 207 | * strategies, and techniques. |
|
223 | 223 | * engendering its own abstraction layer in the CUDA software stack (i.e., the "black boxes" |
224 | 224 | * below): |
225 | 225 | * |
226 | | - * <table border="0px" style="padding:0px; border:0px; margin:0px;"><tr> |
| 226 | + * <table border="0px" cellpadding="0px" cellspacing="0px"><tr> |
227 | 227 | * <td width="50%"> |
228 | 228 | * \par |
229 | | - * <b>CUDA kernel</b>. A CPU program invokes a CUDA kernel to perform |
230 | | - * some data-parallel function. Reuse of entire kernels (by incorporating them into |
231 | | - * libraries) is the most common form of code reuse for CUDA. Libraries of CUDA kernels include |
232 | | - * the following: |
233 | | - * - [<b>cuBLAS</b>](https://developer.nvidia.com/cublas) |
234 | | - * - [<b>cuFFT</b>](https://developer.nvidia.com/cufft) |
235 | | - * - [<b>cuSPARSE</b>](https://developer.nvidia.com/cusparse) |
236 | | - * - [<b>Thrust</b>](http://thrust.github.com/) |
| 229 | + * <b>CUDA kernel</b>. A single CPU thread invokes a CUDA kernel to perform |
| 230 | + * some data-parallel function. The incorporation of entire kernels (and their |
| 231 | + * corresponding invocation stubs) into libraries is the most common form of code reuse for |
| 232 | + * CUDA. Libraries of CUDA kernels include the following: |
| 233 | + * - [<b><em>cuBLAS</em></b>](https://developer.nvidia.com/cublas) |
| 234 | + * - [<b><em>cuFF</em>T</b>](https://developer.nvidia.com/cufft) |
| 235 | + * - [<b><em>cuSPARSE</em></b>](https://developer.nvidia.com/cusparse) |
| 236 | + * - [<b><em>Thrust</em></b>](http://thrust.github.com/) |
237 | 237 | * </td> |
238 | 238 | * <td width="50%"> |
239 | 239 | * \htmlonly |
240 | | - * <a href="kernel_abstraction.png"><center><img src="kernel_abstraction.png" width="100%"/></center></a> |
| 240 | + * <a href="kernel_abstraction.png"><img src="kernel_abstraction.png" width="100%"/></a> |
241 | 241 | * \endhtmlonly |
242 | 242 | * </td> |
243 | 243 | * </tr><tr> |
244 | 244 | * <td> |
245 | 245 | * \par |
246 | | - * <b>Thread blocks (SIMT)</b>. Each kernel invocation comprises some number of parallel threads. Threads |
247 | | - * are grouped into blocks, and the threads within a block can communicate and synchronize with each other |
248 | | - * to perform some cooperative function. There has historically been very little reuse of cooperative SIMT |
249 | | - * software within CUDA kernel. Libraries of thread-block primitives include the following: |
250 | | - * - [<b>CUB</b>](index.html) |
| 246 | + * <b>Thread blocks (SIMT)</b>. Each kernel invocation comprises some number of parallel |
| 247 | + * threads. Threads are grouped into blocks, and the entire block of threads invokes some cooperative |
| 248 | + * function in which they communicate and synchronize with each other. There has historically been very |
| 249 | + * little reuse of cooperative SIMT software within CUDA kernel. Libraries of thread-block primitives |
| 250 | + * include the following: |
| 251 | + * - [<b><em>CUB</em></b>](index.html) |
251 | 252 | * </td> |
252 | 253 | * <td> |
253 | 254 | * \htmlonly |
254 | | - * <a href="simt_abstraction.png"><center><img src="simt_abstraction.png" width="100%"/></center></a> |
| 255 | + * <a href="simt_abstraction.png"><img src="simt_abstraction.png" width="100%"/></a> |
255 | 256 | * \endhtmlonly |
256 | 257 | * </td> |
257 | 258 | * </tr><tr> |
258 | 259 | * <td> |
259 | 260 | * \par |
260 | | - * <b>CUDA thread (scalar)</b>. A single CUDA thread invokes some scalar function. |
261 | | - * This is the lowest level of CUDA software abstraction, and is useful when there is no |
262 | | - * need to reason about the interaction of parallel threads. CUDA libraries of |
| 261 | + * <b>CUDA thread</b>. A single CUDA thread invokes some sequential function. |
| 262 | + * This is the finest-grained level of CUDA software abstraction and requires |
| 263 | + * no consideration for the scheduling or synchronization of parallel threads. CUDA libraries of |
263 | 264 | * purely data-parallel functions include the following: |
264 | | - * - [<b>CUDA Math Library</b>](https://developer.nvidia.com/cuda-math-library) (e.g., \p text1D(), \p atomicAdd(), \p popc(), etc.) |
265 | | - * - [<b>cuRAND</b>](https://developer.nvidia.com/curand)'s device-code interface |
266 | | - * - [<b>CUB</b>](index.html) |
| 265 | + * - [<b><em> CUDA Math</em></b>](http://docs.nvidia.com/cuda/cuda-math-api/index.html), |
| 266 | + * [<b><em>Texture</em></b>](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#texture-functions), and |
| 267 | + * [<b><em>Atomic</em></b>](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions) APIs |
| 268 | + * - [<b><em>cuRAND</em></b>](https://developer.nvidia.com/curand)'s device-code interface |
| 269 | + * - [<b><em>CUB</em></b>](index.html) |
267 | 270 | * </td> |
268 | 271 | * <td> |
269 | 272 | * \htmlonly |
270 | | - * <a href="devfun_abstraction.png"><center><img src="devfun_abstraction.png" width="100%"/></center></a> |
| 273 | + * <a href="devfun_abstraction.png"><img src="devfun_abstraction.png" width="100%"/></a> |
271 | 274 | * \endhtmlonly |
272 | 275 | * </td> |
273 | 276 | * </tr></table> |
|
334 | 337 | * To address this issue, we encapsulate cooperative procedures within |
335 | 338 | * <em>reflective type structure</em> (C++ classes). As illustrated in the |
336 | 339 | * cub::BlockScan example above, these primitives are C++ classes with |
337 | | - * interfaces that expose both (1) procedural methods as well as (2) the opaque |
338 | | - * shared memory types needed for their operation. |
| 340 | + * interfaces that expose both: |
| 341 | + * - Procedural entrypoints for a block of threads to invoke |
| 342 | + * - An opaque shared memory type needed for the operation of those methods |
339 | 343 | * |
340 | 344 | * \subsection sec3sec3 6.3 Flexible data mapping |
341 | 345 | * |
342 | 346 | * \par |
343 | 347 | * We often design kernels such that each thread block is assigned a "tile" of data |
344 | | - * items for processing. When the tile size equals the thread block size, the |
| 348 | + * items for processing. |
| 349 | + * |
| 350 | + * \par |
| 351 | + * \image html tile.png |
| 352 | + * <div class="centercaption">Tile of eight ordered data items</div> |
| 353 | +
|
| 354 | + * \par |
| 355 | + * When the tile size equals the thread block size, the |
345 | 356 | * mapping of data onto threads is straightforward (one datum per thread). |
346 | | - * However, it is often desirable for performance reasons to process more |
347 | | - * than one datum per thread. When doing so, we must decide how |
348 | | - * to partition this "tile" of items across the thread block. |
| 357 | + * However, there are often performance advantages for processing more |
| 358 | + * than one datum per thread. For these scenarios, CUB primitives |
| 359 | + * support the following alternatives for partitioning data items across |
| 360 | + * the block of threads: |
| 361 | + * |
349 | 362 | * |
| 363 | + * <table border="0px" cellpadding="0px" cellspacing="0px"><tr> |
| 364 | + * <td> |
350 | 365 | * \par |
351 | | - * CUB primitives support the following data arrangements: |
352 | 366 | * - <b><em>Blocked arrangement</em></b>. The aggregate tile of items is partitioned |
353 | 367 | * evenly across threads in "blocked" fashion with thread<sub><em>i</em></sub> |
354 | 368 | * owning the <em>i</em><sup>th</sup> segment of consecutive elements. |
| 369 | + * </td> |
| 370 | + * <td> |
| 371 | + * \par |
| 372 | + * \image html blocked.png |
| 373 | + * <div class="centercaption"><em>Blocked</em> arrangement across four threads <br>(emphasis on items owned by <em>thread</em><sub>0</sub>)</div> |
| 374 | + * </td> |
| 375 | + * </tr><tr> |
| 376 | + * <td> |
| 377 | + * \par |
355 | 378 | * - <b><em>Striped arrangement</em></b>. The aggregate tile of items is partitioned across |
356 | 379 | * threads in "striped" fashion, i.e., the \p ITEMS_PER_THREAD items owned by |
357 | 380 | * each thread have logical stride \p BLOCK_THREADS between them. |
358 | | - * <br><br> |
359 | | - * \image html thread_data_1.png |
360 | | - * <div class="centercaption">Blocked vs. striped arrangements with \p BLOCK_THREADS = 4 and |
361 | | - * \p ITEMS_PER_THREAD = 2, emphasis on items owned by <em>thread</em><sub>0</sub></div> |
362 | | - * <br> |
| 381 | + * </td> |
| 382 | + * <td> |
| 383 | + * \par |
| 384 | + * \image html striped.png |
| 385 | + * <div class="centercaption"><em>Striped</em> arrangement across four threads <br>(emphasis on items owned by <em>thread</em><sub>0</sub>)</div> |
| 386 | + * </td> |
| 387 | + * </tr></table> |
363 | 388 | * |
364 | 389 | * \par |
365 | 390 | * The benefits of processing multiple items per thread (a.k.a., <em>register blocking</em>, <em>granularity coarsening</em>, etc.) include: |
366 | | - * - <b>Algorithmic efficiency</b>. Sequential work over multiple items in |
| 391 | + * - Algorithmic efficiency. Sequential work over multiple items in |
367 | 392 | * thread-private registers is cheaper than synchronized, cooperative |
368 | 393 | * work through shared memory spaces. |
369 | | - * - <b>Data occupancy</b>. The number of items that can be resident on-chip in |
| 394 | + * - Data occupancy. The number of items that can be resident on-chip in |
370 | 395 | * thread-private register storage is often greater than the number of |
371 | 396 | * schedulable threads. |
372 | | - * - <b>Instruction-level parallelism</b>. Multiple items per thread also |
| 397 | + * - Instruction-level parallelism. Multiple items per thread also |
373 | 398 | * facilitates greater ILP for improved throughput and utilization. |
374 | 399 | * |
375 | 400 | * \par |
376 | | - * The cub::BlockExchange primitive provides operations for converting between blocked |
| 401 | + * Furthermore, cub::BlockExchange provides operations for converting between blocked |
377 | 402 | * and striped arrangements. Blocked arrangements are often desirable for |
378 | 403 | * algorithmic benefits (where long sequences of items can be processed sequentially |
379 | 404 | * within each thread). Striped arrangements are often desirable for data movement |
|
0 commit comments