Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Commit 725528f

Browse files
committed
Update doc to clarify tiling arrangements
Former-commit-id: 3ebb45b
1 parent b8895b6 commit 725528f

File tree

6 files changed

+76
-45
lines changed

6 files changed

+76
-45
lines changed

cub/cub.cuh

Lines changed: 68 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@
8585
* <br>
8686
* <a href="https://github.com/NVlabs/cub"><img src="github-icon-747d8b799a48162434b2c0595ba1317e.png" style="position:relative; bottom:-10px; border:0px;"/></a>
8787
* &nbsp;&nbsp;
88-
* <a href="https://github.com/NVlabs/cub">Fork CUB at GitHub!</a>
88+
* <a href="https://github.com/NVlabs/cub">Browse or fork CUB at GitHub!</a>
8989
* <br>
9090
* <a href="http://groups.google.com/group/cub-users"><img src="groups-icon.png" style="position:relative; bottom:-10px; border:0px;"/></a>
9191
* &nbsp;&nbsp;
@@ -96,7 +96,7 @@
9696
*
9797
* \par
9898
* CUB is a library of high-performance parallel primitives and other utilities for
99-
* building CUDA kernel software. CUB enhances productivity, performance, and portability
99+
* constructing CUDA kernel software. CUB enhances productivity, performance, and portability
100100
* by providing an abstraction layer over complex
101101
* [block-level] (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model),
102102
* [warp-level] (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation), and
@@ -105,7 +105,7 @@
105105
* \par
106106
* CUB's primitives are not bound to any particular width of parallelism or to any particular
107107
* data type. This allows them to be flexible and tunable to fit your kernels' needs.
108-
* Thus CUB is [<b>C</b>UDA <b>U</b>n<b>b</b>ound](index.html).
108+
* Thus CUB is [<em>CUDA Unbound</em>](index.html).
109109
*
110110
* \image html cub_overview.png
111111
*
@@ -201,7 +201,7 @@
201201
* library and software abstraction layer, CUB provides:
202202
* -# <b>Simplicity of composition.</b> Parallel CUB primitives can be simply sequenced
203203
* together in kernel code. (This convenience is analogous to programming with
204-
* [<b>Thrust</b>](http://thrust.github.com/) primitives in the host program.)
204+
* [<b><em>Thrust</em></b>](http://thrust.github.com/) primitives in the host program.)
205205
* -# <b>High performance.</b> CUB simplifies high performance kernel development by
206206
* taking care to implement and make available the fastest available algorithms,
207207
* strategies, and techniques.
@@ -223,51 +223,54 @@
223223
* engendering its own abstraction layer in the CUDA software stack (i.e., the "black boxes"
224224
* below):
225225
*
226-
* <table border="0px" style="padding:0px; border:0px; margin:0px;"><tr>
226+
* <table border="0px" cellpadding="0px" cellspacing="0px"><tr>
227227
* <td width="50%">
228228
* \par
229-
* <b>CUDA kernel</b>. A CPU program invokes a CUDA kernel to perform
230-
* some data-parallel function. Reuse of entire kernels (by incorporating them into
231-
* libraries) is the most common form of code reuse for CUDA. Libraries of CUDA kernels include
232-
* the following:
233-
* - [<b>cuBLAS</b>](https://developer.nvidia.com/cublas)
234-
* - [<b>cuFFT</b>](https://developer.nvidia.com/cufft)
235-
* - [<b>cuSPARSE</b>](https://developer.nvidia.com/cusparse)
236-
* - [<b>Thrust</b>](http://thrust.github.com/)
229+
* <b>CUDA kernel</b>. A single CPU thread invokes a CUDA kernel to perform
230+
* some data-parallel function. The incorporation of entire kernels (and their
231+
* corresponding invocation stubs) into libraries is the most common form of code reuse for
232+
* CUDA. Libraries of CUDA kernels include the following:
233+
* - [<b><em>cuBLAS</em></b>](https://developer.nvidia.com/cublas)
234+
* - [<b><em>cuFF</em>T</b>](https://developer.nvidia.com/cufft)
235+
* - [<b><em>cuSPARSE</em></b>](https://developer.nvidia.com/cusparse)
236+
* - [<b><em>Thrust</em></b>](http://thrust.github.com/)
237237
* </td>
238238
* <td width="50%">
239239
* \htmlonly
240-
* <a href="kernel_abstraction.png"><center><img src="kernel_abstraction.png" width="100%"/></center></a>
240+
* <a href="kernel_abstraction.png"><img src="kernel_abstraction.png" width="100%"/></a>
241241
* \endhtmlonly
242242
* </td>
243243
* </tr><tr>
244244
* <td>
245245
* \par
246-
* <b>Thread blocks (SIMT)</b>. Each kernel invocation comprises some number of parallel threads. Threads
247-
* are grouped into blocks, and the threads within a block can communicate and synchronize with each other
248-
* to perform some cooperative function. There has historically been very little reuse of cooperative SIMT
249-
* software within CUDA kernel. Libraries of thread-block primitives include the following:
250-
* - [<b>CUB</b>](index.html)
246+
* <b>Thread blocks (SIMT)</b>. Each kernel invocation comprises some number of parallel
247+
* threads. Threads are grouped into blocks, and the entire block of threads invokes some cooperative
248+
* function in which they communicate and synchronize with each other. There has historically been very
249+
* little reuse of cooperative SIMT software within CUDA kernel. Libraries of thread-block primitives
250+
* include the following:
251+
* - [<b><em>CUB</em></b>](index.html)
251252
* </td>
252253
* <td>
253254
* \htmlonly
254-
* <a href="simt_abstraction.png"><center><img src="simt_abstraction.png" width="100%"/></center></a>
255+
* <a href="simt_abstraction.png"><img src="simt_abstraction.png" width="100%"/></a>
255256
* \endhtmlonly
256257
* </td>
257258
* </tr><tr>
258259
* <td>
259260
* \par
260-
* <b>CUDA thread (scalar)</b>. A single CUDA thread invokes some scalar function.
261-
* This is the lowest level of CUDA software abstraction, and is useful when there is no
262-
* need to reason about the interaction of parallel threads. CUDA libraries of
261+
* <b>CUDA thread</b>. A single CUDA thread invokes some sequential function.
262+
* This is the finest-grained level of CUDA software abstraction and requires
263+
* no consideration for the scheduling or synchronization of parallel threads. CUDA libraries of
263264
* purely data-parallel functions include the following:
264-
* - [<b>CUDA Math Library</b>](https://developer.nvidia.com/cuda-math-library) (e.g., \p text1D(), \p atomicAdd(), \p popc(), etc.)
265-
* - [<b>cuRAND</b>](https://developer.nvidia.com/curand)'s device-code interface
266-
* - [<b>CUB</b>](index.html)
265+
* - [<b><em> CUDA Math</em></b>](http://docs.nvidia.com/cuda/cuda-math-api/index.html),
266+
* [<b><em>Texture</em></b>](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#texture-functions), and
267+
* [<b><em>Atomic</em></b>](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions) APIs
268+
* - [<b><em>cuRAND</em></b>](https://developer.nvidia.com/curand)'s device-code interface
269+
* - [<b><em>CUB</em></b>](index.html)
267270
* </td>
268271
* <td>
269272
* \htmlonly
270-
* <a href="devfun_abstraction.png"><center><img src="devfun_abstraction.png" width="100%"/></center></a>
273+
* <a href="devfun_abstraction.png"><img src="devfun_abstraction.png" width="100%"/></a>
271274
* \endhtmlonly
272275
* </td>
273276
* </tr></table>
@@ -334,46 +337,68 @@
334337
* To address this issue, we encapsulate cooperative procedures within
335338
* <em>reflective type structure</em> (C++ classes). As illustrated in the
336339
* cub::BlockScan example above, these primitives are C++ classes with
337-
* interfaces that expose both (1) procedural methods as well as (2) the opaque
338-
* shared memory types needed for their operation.
340+
* interfaces that expose both:
341+
* - Procedural entrypoints for a block of threads to invoke
342+
* - An opaque shared memory type needed for the operation of those methods
339343
*
340344
* \subsection sec3sec3 6.3 &nbsp;&nbsp; Flexible data mapping
341345
*
342346
* \par
343347
* We often design kernels such that each thread block is assigned a "tile" of data
344-
* items for processing. When the tile size equals the thread block size, the
348+
* items for processing.
349+
*
350+
* \par
351+
* \image html tile.png
352+
* <div class="centercaption">Tile of eight ordered data items</div>
353+
354+
* \par
355+
* When the tile size equals the thread block size, the
345356
* mapping of data onto threads is straightforward (one datum per thread).
346-
* However, it is often desirable for performance reasons to process more
347-
* than one datum per thread. When doing so, we must decide how
348-
* to partition this "tile" of items across the thread block.
357+
* However, there are often performance advantages for processing more
358+
* than one datum per thread. For these scenarios, CUB primitives
359+
* support the following alternatives for partitioning data items across
360+
* the block of threads:
361+
*
349362
*
363+
* <table border="0px" cellpadding="0px" cellspacing="0px"><tr>
364+
* <td>
350365
* \par
351-
* CUB primitives support the following data arrangements:
352366
* - <b><em>Blocked arrangement</em></b>. The aggregate tile of items is partitioned
353367
* evenly across threads in "blocked" fashion with thread<sub><em>i</em></sub>
354368
* owning the <em>i</em><sup>th</sup> segment of consecutive elements.
369+
* </td>
370+
* <td>
371+
* \par
372+
* \image html blocked.png
373+
* <div class="centercaption"><em>Blocked</em> arrangement across four threads <br>(emphasis on items owned by <em>thread</em><sub>0</sub>)</div>
374+
* </td>
375+
* </tr><tr>
376+
* <td>
377+
* \par
355378
* - <b><em>Striped arrangement</em></b>. The aggregate tile of items is partitioned across
356379
* threads in "striped" fashion, i.e., the \p ITEMS_PER_THREAD items owned by
357380
* each thread have logical stride \p BLOCK_THREADS between them.
358-
* <br><br>
359-
* \image html thread_data_1.png
360-
* <div class="centercaption">Blocked vs. striped arrangements with \p BLOCK_THREADS = 4 and
361-
* \p ITEMS_PER_THREAD = 2, emphasis on items owned by <em>thread</em><sub>0</sub></div>
362-
* <br>
381+
* </td>
382+
* <td>
383+
* \par
384+
* \image html striped.png
385+
* <div class="centercaption"><em>Striped</em> arrangement across four threads <br>(emphasis on items owned by <em>thread</em><sub>0</sub>)</div>
386+
* </td>
387+
* </tr></table>
363388
*
364389
* \par
365390
* The benefits of processing multiple items per thread (a.k.a., <em>register blocking</em>, <em>granularity coarsening</em>, etc.) include:
366-
* - <b>Algorithmic efficiency</b>. Sequential work over multiple items in
391+
* - Algorithmic efficiency. Sequential work over multiple items in
367392
* thread-private registers is cheaper than synchronized, cooperative
368393
* work through shared memory spaces.
369-
* - <b>Data occupancy</b>. The number of items that can be resident on-chip in
394+
* - Data occupancy. The number of items that can be resident on-chip in
370395
* thread-private register storage is often greater than the number of
371396
* schedulable threads.
372-
* - <b>Instruction-level parallelism</b>. Multiple items per thread also
397+
* - Instruction-level parallelism. Multiple items per thread also
373398
* facilitates greater ILP for improved throughput and utilization.
374399
*
375400
* \par
376-
* The cub::BlockExchange primitive provides operations for converting between blocked
401+
* Furthermore, cub::BlockExchange provides operations for converting between blocked
377402
* and striped arrangements. Blocked arrangements are often desirable for
378403
* algorithmic benefits (where long sequences of items can be processed sequentially
379404
* within each thread). Striped arrangements are often desirable for data movement

cub/docs/extra_stylesheet.css

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,13 @@ hr.footer {
154154
* Text content
155155
*****************************************************************************/
156156

157+
.contents a {
158+
color: #5675B2;
159+
}
157160

161+
.contents a:visited {
162+
color: #5675B2;
163+
}
158164

159165
h2 {
160166
margin-top: 30px;
@@ -257,11 +263,11 @@ dl.section.user b {
257263
}
258264

259265
dl.section.user a b {
260-
color: #3D578C;
266+
color: #5675B2;
261267
}
262268

263269
dl.section.user a:visited b {
264-
color: #4665A2;
270+
color: #5675B2;
265271
}
266272

267273
dl.section.user dd {

cub/docs/images/blocked.png

2.36 KB
Loading

cub/docs/images/cub.png

2.17 KB
Loading

cub/docs/images/striped.png

2.51 KB
Loading

cub/docs/images/tile.png

454 Bytes
Loading

0 commit comments

Comments
 (0)