camenduru's picture
thanks to nvidia ❤
8ae5fc5
CUB
==================================================
.. toctree::
:hidden:
:maxdepth: 2
${repo_docs_api_path}/CUB_api
developer_overview
test_overview
.. the line below can be used to use the README.md file as the index page
.. .. mdinclude:: ../README.md
What is CUB?
==================================================
CUB provides state-of-the-art, reusable software components for every layer
of the CUDA programming model:
* **Parallel primitives**
* **Warp-wide "collective" primitives**
* Cooperative warp-wide prefix scan, reduction, etc.
* Safely specialized for each underlying CUDA architecture
* **Block-wide "collective" primitives**
* Cooperative I/O, sort, scan, reduction, histogram, etc.
* Compatible with arbitrary thread block sizes and types
* **Device-wide primitives**
* Parallel sort, prefix scan, reduction, histogram, etc.
* Compatible with CUDA dynamic parallelism
* **Utilities**
* **Fancy iterators**
* **Thread and thread block I/O**
* **PTX intrinsics**
* **Device, kernel, and storage management**
.. _collective-primitives:
CUB's collective primitives
==================================================
Collective software primitives are essential for constructing high-performance,
maintainable CUDA kernel code. Collectives allow complex parallel code to be
re-used rather than re-implemented, and to be re-compiled rather than
hand-ported.
.. figure:: img/cub_overview.png
:align: center
:alt: Orientation of collective primitives within the CUDA software stack
:name: fig_cub_overview
Orientation of collective primitives within the CUDA software stack
As a SIMT programming model, CUDA engenders both **scalar** and
**collective** software interfaces. Traditional software
interfaces are *scalar* : a single thread invokes a library routine to perform some
operation (which may include spawning parallel subtasks). Alternatively, a *collective*
interface is entered simultaneously by a group of parallel threads to perform
some cooperative operation.
CUB's collective primitives are not bound to any particular width of parallelism
or data type. This flexibility makes them:
* **Adaptable** to fit the needs of the enclosing kernel computation
* **Trivially tunable** to different grain sizes (threads per block, items per thread, etc.)
Thus CUB is *CUDA Unbound*.
An example (block-wide sorting)
==================================================
The following code snippet presents a CUDA kernel in which each block of ``BLOCK_THREADS`` threads
will collectively load, sort, and store its own segment of (``BLOCK_THREADS * ITEMS_PER_THREAD``)
integer keys:
.. code-block:: c++
#include <cub/cub.cuh>
//
// Block-sorting CUDA kernel
//
template <int BLOCK_THREADS, int ITEMS_PER_THREAD>
__global__ void BlockSortKernel(int *d_in, int *d_out)
{
// Specialize BlockLoad, BlockStore, and BlockRadixSort collective types
typedef cub::BlockLoad<
int*, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_TRANSPOSE> BlockLoadT;
typedef cub::BlockStore<
int*, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_STORE_TRANSPOSE> BlockStoreT;
typedef cub::BlockRadixSort<
int, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;
// Allocate type-safe, repurposable shared memory for collectives
__shared__ union {
typename BlockLoadT::TempStorage load;
typename BlockStoreT::TempStorage store;
typename BlockRadixSortT::TempStorage sort;
} temp_storage;
// Obtain this block's segment of consecutive keys (blocked across threads)
int thread_keys[ITEMS_PER_THREAD];
int block_offset = blockIdx.x * (BLOCK_THREADS * ITEMS_PER_THREAD);
BlockLoadT(temp_storage.load).Load(d_in + block_offset, thread_keys);
__syncthreads(); // Barrier for smem reuse
// Collectively sort the keys
BlockRadixSortT(temp_storage.sort).Sort(thread_keys);
__syncthreads(); // Barrier for smem reuse
// Store the sorted segment
BlockStoreT(temp_storage.store).Store(d_out + block_offset, thread_keys);
}
.. code-block:: c++
// Elsewhere in the host program: parameterize and launch a block-sorting
// kernel in which blocks of 128 threads each sort segments of 2048 keys
int *d_in = ...;
int *d_out = ...;
int num_blocks = ...;
BlockSortKernel<128, 16><<<num_blocks, 128>>>(d_in, d_out);
In this example, threads use ``cub::BlockLoad``, ``cub::BlockRadixSort``, and ``cub::BlockStore``
to collectively load, sort and store the block's segment of input items. Because these operations
are cooperative, each primitive requires an allocation of shared memory for threads to communicate
through. The typical usage pattern for a CUB collective is:
#. Statically specialize the primitive for the specific problem setting at hand, e.g.,
the data type being sorted, the number of threads per block, the number of keys per
thread, optional algorithmic alternatives, etc. (CUB primitives are also implicitly
specialized by the targeted compilation architecture.)
#. Allocate (or alias) an instance of the specialized primitive's nested ``TempStorage``
type within a shared memory space.
#. Specify communication details (e.g., the ``TempStorage`` allocation) to
construct an instance of the primitive.
#. Invoke methods on the primitive instance.
In particular, ``cub::BlockRadixSort`` is used to collectively sort the segment of data items
that have been partitioned across the thread block. To provide coalesced accesses
to device memory, we configure the cub::BlockLoad and cub::BlockStore primitives
to access memory using a striped access pattern (where consecutive threads
simultaneously access consecutive items) and then *transpose* the keys into
a :ref:`blocked arrangement <flexible-data-arrangement>` of elements across threads.
To reuse shared memory across all three primitives, the thread block statically
allocates a union of their ``TempStorage`` types.
Why do you need CUB?
==================================================
Writing, tuning, and maintaining kernel code is perhaps the most challenging,
time-consuming aspect of CUDA programming. Kernel software is where
the complexity of parallelism is expressed. Programmers must reason about
deadlock, livelock, synchronization, race conditions, shared memory layout,
plurality of state, granularity, throughput, latency, memory bottlenecks, etc.
With the exception of CUB, however, there are few (if any) software libraries of
*reusable* kernel primitives. In the CUDA ecosystem, CUB is unique in this regard.
As a `SIMT <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation>`_
library and software abstraction layer, CUB provides:
#. **Simplicity of composition**. CUB enhances programmer productivity by
allowing complex parallel operations to be easily sequenced and nested.
For example, ``cub::BlockRadixSort`` is constructed from cub::BlockExchange and
``cub::BlockRadixRank``. The latter is composed of cub::BlockScan
which incorporates cub::WarpScan.
.. figure:: img/nested_composition.png
:align: center
#. **High performance**. CUB simplifies high-performance program and kernel
development by taking care to implement the state-of-the-art in parallel algorithms.
#. **Performance portability**.
CUB primitives are specialized to match the diversity of NVIDIA hardware, continuously
evolving to accommodate new architecture-specific features and instructions. And
because CUB's device-wide primitives are implemented using flexible block-wide and
warp-wide collectives, we are able to performance-tune them to match the processor
resources provided by each CUDA processor architecture.
#. **Simplicity of performance tuning**:
* **Resource utilization**. CUB primitives allow developers to quickly
change grain sizes (threads per block, items per thread, etc.) to best match
the processor resources of their target architecture
* **Variant tuning**. Most CUB primitives support alternative algorithmic
strategies. For example, cub::BlockHistogram is parameterized to implement either
an atomic-based approach or a sorting-based approach. (The latter provides uniform
performance regardless of input distribution.)
* **Co-optimization**. When the enclosing kernel
is similarly parameterizable, a tuning configuration can be found that optimally
accommodates their combined register and shared memory pressure.
#. **Robustness and durability**. CUB just works. CUB primitives
are designed to function properly for arbitrary data types and widths of
parallelism (not just for the built-in C++ types or for powers-of-two threads
per block).
#. **Reduced maintenance burden**. CUB provides a SIMT software abstraction layer
over the diversity of CUDA hardware. With CUB, applications can enjoy
performance-portability without intensive and costly rewriting or porting efforts.
#. **A path for language evolution**. CUB primitives are designed
to easily accommodate new features in the CUDA programming model, e.g., thread
subgroups and named barriers, dynamic shared memory allocators, etc.
How do CUB collectives work?
==================================================
Four programming idioms are central to the design of CUB:
#. :ref:`Generic programming <generic-programming>`. C++ templates provide the flexibility
and adaptive code generation needed for CUB primitives to be useful, reusable, and
fast in arbitrary kernel settings.
#. :ref:`Reflective class interfaces <reflective-class-interfaces>`.
CUB collectives statically export their their resource requirements
(e.g., shared memory size and layout) for a given specialization, which allows compile-time
tuning decisions and resource allocation.
#. :ref:`Flexible data arrangement across threads <flexible-data-arrangement>`.
CUB collectives operate on data that is logically partitioned across a group of threads.
For most collective operations, efficiency is increased with increased granularity
(i.e., items per thread).
#. :ref:`Static tuning and co-tuning <static-tuning-and-co-tuning>`. Simple constants and static
types dictate the granularities and algorithmic alternatives to be employed by CUB collectives.
When the enclosing kernel is similarly parameterized, an optimal configuration can be determined
that best accommodates the combined behavior and resource consumption of all primitives within
the kernel.
.. _generic-programming:
Generic programming
--------------------------------------------------
We use template parameters to specialize CUB primitives for the particular
problem setting at hand. Until compile time, CUB primitives are not bound
to any particular:
* Data type (int, float, double, etc.)
* Width of parallelism (threads per thread block)
* Grain size (data items per thread)
* Underlying processor (special instructions, warp size, rules for bank conflicts, etc.)
* Tuning configuration (e.g., latency vs. throughput, algorithm selection, etc.)
.. _reflective-class-interfaces:
Reflective class interfaces
--------------------------------------------------
Unlike traditional function-oriented interfaces, CUB exposes its collective
primitives as templated C++ classes. The resource requirements for a specific
parameterization are reflectively advertised as members of the class. The
resources can then be statically or dynamically allocated, aliased
to global or shared memory, etc. The following illustrates a CUDA kernel
fragment performing a collective prefix sum across the threads of a thread block:
.. code-block:: c++
#include <cub/cub.cuh>
__global__ void SomeKernelFoo(...)
{
// Specialize BlockScan for 128 threads on integer types
typedef cub::BlockScan<int, 128> BlockScan;
// Allocate shared memory for BlockScan
__shared__ typename BlockScan::TempStorage scan_storage;
...
// Obtain a segment of consecutive items that are blocked across threads
int thread_data_in[4];
int thread_data_out[4];
...
// Perform an exclusive block-wide prefix sum
BlockScan(scan_storage).ExclusiveSum(thread_data_in, thread_data_out);
Furthermore, the CUB interface is designed to separate parameter
fields by concerns. CUB primitives have three distinct parameter fields:
#. *Static template parameters*. These are constants that will
dictate the storage layout and the unrolling of algorithmic steps (e.g.,
the input data type and the number of block threads), and are used to specialize the class.
#. *Constructor parameters*. These are optional parameters regarding
inter-thread communication (e.g., storage allocation, thread-identifier mapping,
named barriers, etc.), and are orthogonal to the functions exposed by the class.
#. *Formal method parameters*. These are the operational inputs/outputs
for the various functions exposed by the class.
This allows CUB types to easily accommodate new
programming model features (e.g., named barriers, memory allocators, etc.)
without incurring a combinatorial growth of interface methods.
.. _flexible-data-arrangement:
Flexible data arrangement across threads
--------------------------------------------------
CUDA kernels are often designed such that each thread block is assigned a
segment of data items for processing.
.. figure:: img/tile.png
:align: center
:alt: Segment of eight ordered data items
:name: fig_tile
Segment of eight ordered data items
When the tile size equals the thread block size, the
mapping of data onto threads is straightforward (one datum per thread).
However, there are often performance advantages for processing more
than one datum per thread. Increased granularity corresponds to
decreased communication overhead. For these scenarios, CUB primitives
will specify which of the following partitioning alternatives they
accommodate:
.. list-table::
:class: table-no-stripes
:widths: 70 30
* - **Blocked arrangement**. The aggregate tile of items is partitioned
evenly across threads in "blocked" fashion with *thread*\ :sub:`i`
owning the *i*\ :sup:`th` segment of consecutive elements.
Blocked arrangements are often desirable for algorithmic benefits (where
long sequences of items can be processed sequentially within each thread).
- .. figure:: img/blocked.png
:align: center
:alt: *Blocked* arrangement across four threads
:name: fig_blocked
*Blocked* arrangement across four threads
(emphasis on items owned by *thread*\ :sub:`0`)
* - **Striped arrangement**. The aggregate tile of items is partitioned across threads in "striped"
fashion, i.e., the ``ITEMS_PER_THREAD`` items owned by each thread have logical stride
``BLOCK_THREADS`` between them. Striped arrangements are often desirable for data movement through
global memory (where
`read/write coalescing <https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory>`_
is an important performance consideration).
- .. figure:: img/striped.png
:align: center
:alt: *Striped* arrangement across four threads
:name: fig_striped
*Striped* arrangement across four threads
(emphasis on items owned by *thread*\ :sub:`0`)
The benefits of processing multiple items per thread (a.k.a., *register blocking*,
*granularity coarsening*, etc.) include:
* Algorithmic efficiency. Sequential work over multiple items in
thread-private registers is cheaper than synchronized, cooperative
work through shared memory spaces.
* Data occupancy. The number of items that can be resident on-chip in
thread-private register storage is often greater than the number of
schedulable threads.
* Instruction-level parallelism. Multiple items per thread also
facilitates greater ILP for improved throughput and utilization.
Finally, cub::BlockExchange provides operations for converting between blocked
and striped arrangements.
.. _static-tuning-and-co-tuning:
Static tuning and co-tuning
--------------------------------------------------
This style of flexible interface simplifies performance tuning. Most CUB
primitives support alternative algorithmic strategies that can be
statically targeted by a compiler-based or JIT-based autotuner. (For
example, cub::BlockHistogram is parameterized to implement either an
atomic-based approach or a sorting-based approach.) Algorithms are also
tunable over parameters such as thread count and grain size as well.
Taken together, each of the CUB algorithms provides a fairly rich tuning
space.
Whereas conventional libraries are optimized offline and in isolation, CUB
provides interesting opportunities for whole-program optimization. For
example, each CUB primitive is typically parameterized by threads-per-block
and items-per-thread, both of which affect the underlying algorithm's
efficiency and resource requirements. When the enclosing kernel is similarly
parameterized, the coupled CUB primitives adjust accordingly. This enables
autotuners to search for a single configuration that maximizes the performance
of the entire kernel for a given set of hardware resources.
How do I get started using CUB?
==================================================
CUB is implemented as a C++ header library. There is no need to build CUB
separately. To use CUB primitives in your code, simply:
#. Download and unzip the latest CUB distribution
#. ``#include`` the "umbrella" ``<cub/cub.cuh>`` header file in
your CUDA C++ sources. (Or ``#include`` the particular
header files that define the CUB primitives you wish to use.)
#. Compile your program with NVIDIA's ``nvcc`` CUDA compiler,
specifying a ``-I<path-to-CUB>`` include-path flag to reference
the location of the CUB header library.
We also have collection of simple CUB example programs.
How is CUB different than Thrust and Modern GPU?
==================================================
CUB and Thrust
--------------------------------------------------
CUB and `Thrust <http://thrust.github.io/>`_ share some
similarities in that they both provide similar device-wide primitives for CUDA.
However, they target different abstraction layers for parallel computing.
Thrust abstractions are agnostic of any particular parallel framework (e.g.,
CUDA, TBB, OpenMP, sequential CPU, etc.). While Thrust has a "backend"
for CUDA devices, Thrust interfaces themselves are not CUDA-specific and
do not explicitly expose CUDA-specific details (e.g., ``cudaStream_t`` parameters).
CUB, on the other hand, is slightly lower-level than Thrust. CUB is specific
to CUDA C++ and its interfaces explicitly accommodate CUDA-specific features.
Furthermore, CUB is also a library of SIMT collective primitives for block-wide
and warp-wide kernel programming.
CUB and Thrust are complementary and can be used together. In fact, the CUB
project arose out of a maintenance need to achieve better performance-portability
within Thrust by using reusable block-wide primitives to reduce maintenance and
tuning effort.
CUB and Modern GPU
--------------------------------------------------
CUB and `Modern GPU <https://github.com/moderngpu/moderngpu>`_ also
share some similarities in that they both implement similar device-wide primitives for CUDA.
However, they serve different purposes for the CUDA programming community. MGPU
is a pedagogical tool for high-performance GPU computing, providing clear and concise
exemplary code and accompanying commentary. It serves as an excellent source of
educational, tutorial, CUDA-by-example material. The MGPU source code is intended
to be read and studied, and often favors simplicity at the expense of portability and
flexibility.
CUB, on the other hand, is a production-quality library whose sources are complicated
by support for every version of CUDA architecture, and is validated by an extensive
suite of regression tests. Although well-documented, the CUB source text is verbose
and relies heavily on C++ template metaprogramming for situational specialization.
CUB and MGPU are complementary in that MGPU serves as an excellent descriptive source
for many of the algorithmic techniques used by CUB.
Stable releases
==================================================
CUB releases are labeled using version identifiers having three fields:
``<epoch>.<feature>.<update>``. The *epoch* field
corresponds to support for a major change or update to the CUDA programming model.
The *feature* field corresponds to a stable set of features,
functionality, and interface. The *update* field corresponds to a
bug-fix or performance update for that feature set. At the moment, we do
not publicly provide non-stable releases such as development snapshots,
beta releases or rolling releases. (Feel free to contact us if you would
like access to such things.)
Contributors
==================================================
CUB is developed as an open-source project by NVIDIA.
The primary contributor is the CCCL team.
Open Source License
==================================================
CUB is available under the `BSD 3-Clause "New" or "Revised" License <https://github.com/NVIDIA/cub/blob/main/LICENSE.TXT>`_