thanks to nvidia ❤

8ae5fc5 over 2 years ago

21.8 kB

	CUB
	==================================================

	.. toctree::
	:hidden:
	:maxdepth: 2

	${repo_docs_api_path}/CUB_api
	developer_overview
	test_overview

	.. the line below can be used to use the README.md file as the index page
	.. .. mdinclude:: ../README.md

	What is CUB?
	==================================================

	CUB provides state-of-the-art, reusable software components for every layer
	of the CUDA programming model:

	* Parallel primitives

	* Warp-wide "collective" primitives

	* Cooperative warp-wide prefix scan, reduction, etc.
	* Safely specialized for each underlying CUDA architecture

	* Block-wide "collective" primitives

	* Cooperative I/O, sort, scan, reduction, histogram, etc.
	* Compatible with arbitrary thread block sizes and types

	* Device-wide primitives

	* Parallel sort, prefix scan, reduction, histogram, etc.
	* Compatible with CUDA dynamic parallelism

	* Utilities

	* Fancy iterators
	* Thread and thread block I/O
	* PTX intrinsics
	* Device, kernel, and storage management

	.. _collective-primitives:

	CUB's collective primitives
	==================================================

	Collective software primitives are essential for constructing high-performance,
	maintainable CUDA kernel code. Collectives allow complex parallel code to be
	re-used rather than re-implemented, and to be re-compiled rather than
	hand-ported.

	.. figure:: img/cub_overview.png
	:align: center
	:alt: Orientation of collective primitives within the CUDA software stack
	:name: fig_cub_overview

	Orientation of collective primitives within the CUDA software stack

	As a SIMT programming model, CUDA engenders both scalar and
	collective software interfaces. Traditional software
	interfaces are scalar : a single thread invokes a library routine to perform some
	operation (which may include spawning parallel subtasks). Alternatively, a collective
	interface is entered simultaneously by a group of parallel threads to perform
	some cooperative operation.

	CUB's collective primitives are not bound to any particular width of parallelism
	or data type. This flexibility makes them:

	* Adaptable to fit the needs of the enclosing kernel computation
	* Trivially tunable to different grain sizes (threads per block, items per thread, etc.)

	Thus CUB is CUDA Unbound.

	An example (block-wide sorting)
	==================================================

	The following code snippet presents a CUDA kernel in which each block of ``BLOCK_THREADS`` threads
	will collectively load, sort, and store its own segment of (``BLOCK_THREADS * ITEMS_PER_THREAD``)
	integer keys:

	.. code-block:: c++

	#include <cub/cub.cuh>

	//
	// Block-sorting CUDA kernel
	//
	template <int BLOCK_THREADS, int ITEMS_PER_THREAD>
	__global__ void BlockSortKernel(int d_in, int d_out)
	{
	// Specialize BlockLoad, BlockStore, and BlockRadixSort collective types
	typedef cub::BlockLoad<
	int*, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_LOAD_TRANSPOSE> BlockLoadT;
	typedef cub::BlockStore<
	int*, BLOCK_THREADS, ITEMS_PER_THREAD, BLOCK_STORE_TRANSPOSE> BlockStoreT;
	typedef cub::BlockRadixSort<
	int, BLOCK_THREADS, ITEMS_PER_THREAD> BlockRadixSortT;

	// Allocate type-safe, repurposable shared memory for collectives
	__shared__ union {
	typename BlockLoadT::TempStorage load;
	typename BlockStoreT::TempStorage store;
	typename BlockRadixSortT::TempStorage sort;
	} temp_storage;

	// Obtain this block's segment of consecutive keys (blocked across threads)
	int thread_keys[ITEMS_PER_THREAD];
	int block_offset = blockIdx.x * (BLOCK_THREADS * ITEMS_PER_THREAD);
	BlockLoadT(temp_storage.load).Load(d_in + block_offset, thread_keys);

	__syncthreads(); // Barrier for smem reuse

	// Collectively sort the keys
	BlockRadixSortT(temp_storage.sort).Sort(thread_keys);

	__syncthreads(); // Barrier for smem reuse

	// Store the sorted segment
	BlockStoreT(temp_storage.store).Store(d_out + block_offset, thread_keys);
	}


	.. code-block:: c++

	// Elsewhere in the host program: parameterize and launch a block-sorting
	// kernel in which blocks of 128 threads each sort segments of 2048 keys
	int *d_in = ...;
	int *d_out = ...;
	int num_blocks = ...;
	BlockSortKernel<128, 16><<<num_blocks, 128>>>(d_in, d_out);


	In this example, threads use ``cub::BlockLoad``, ``cub::BlockRadixSort``, and ``cub::BlockStore``
	to collectively load, sort and store the block's segment of input items. Because these operations
	are cooperative, each primitive requires an allocation of shared memory for threads to communicate
	through. The typical usage pattern for a CUB collective is:

	#. Statically specialize the primitive for the specific problem setting at hand, e.g.,
	the data type being sorted, the number of threads per block, the number of keys per
	thread, optional algorithmic alternatives, etc. (CUB primitives are also implicitly
	specialized by the targeted compilation architecture.)
	#. Allocate (or alias) an instance of the specialized primitive's nested ``TempStorage``
	type within a shared memory space.
	#. Specify communication details (e.g., the ``TempStorage`` allocation) to
	construct an instance of the primitive.
	#. Invoke methods on the primitive instance.

	In particular, ``cub::BlockRadixSort`` is used to collectively sort the segment of data items
	that have been partitioned across the thread block. To provide coalesced accesses
	to device memory, we configure the cub::BlockLoad and cub::BlockStore primitives
	to access memory using a striped access pattern (where consecutive threads
	simultaneously access consecutive items) and then transpose the keys into
	a :ref:`blocked arrangement <flexible-data-arrangement>` of elements across threads.
	To reuse shared memory across all three primitives, the thread block statically
	allocates a union of their ``TempStorage`` types.


	Why do you need CUB?
	==================================================

	Writing, tuning, and maintaining kernel code is perhaps the most challenging,
	time-consuming aspect of CUDA programming. Kernel software is where
	the complexity of parallelism is expressed. Programmers must reason about
	deadlock, livelock, synchronization, race conditions, shared memory layout,
	plurality of state, granularity, throughput, latency, memory bottlenecks, etc.

	With the exception of CUB, however, there are few (if any) software libraries of
	reusable kernel primitives. In the CUDA ecosystem, CUB is unique in this regard.
	As a `SIMT <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#hardware-implementation>`_
	library and software abstraction layer, CUB provides:

	#. Simplicity of composition. CUB enhances programmer productivity by
	allowing complex parallel operations to be easily sequenced and nested.
	For example, ``cub::BlockRadixSort`` is constructed from cub::BlockExchange and
	``cub::BlockRadixRank``. The latter is composed of cub::BlockScan
	which incorporates cub::WarpScan.

	.. figure:: img/nested_composition.png
	:align: center

	#. High performance. CUB simplifies high-performance program and kernel
	development by taking care to implement the state-of-the-art in parallel algorithms.

	#. Performance portability.
	CUB primitives are specialized to match the diversity of NVIDIA hardware, continuously
	evolving to accommodate new architecture-specific features and instructions. And
	because CUB's device-wide primitives are implemented using flexible block-wide and
	warp-wide collectives, we are able to performance-tune them to match the processor
	resources provided by each CUDA processor architecture.

	#. Simplicity of performance tuning:

	* Resource utilization. CUB primitives allow developers to quickly
	change grain sizes (threads per block, items per thread, etc.) to best match
	the processor resources of their target architecture
	* Variant tuning. Most CUB primitives support alternative algorithmic
	strategies. For example, cub::BlockHistogram is parameterized to implement either
	an atomic-based approach or a sorting-based approach. (The latter provides uniform
	performance regardless of input distribution.)
	* Co-optimization. When the enclosing kernel
	is similarly parameterizable, a tuning configuration can be found that optimally
	accommodates their combined register and shared memory pressure.

	#. Robustness and durability. CUB just works. CUB primitives
	are designed to function properly for arbitrary data types and widths of
	parallelism (not just for the built-in C++ types or for powers-of-two threads
	per block).

	#. Reduced maintenance burden. CUB provides a SIMT software abstraction layer
	over the diversity of CUDA hardware. With CUB, applications can enjoy
	performance-portability without intensive and costly rewriting or porting efforts.

	#. A path for language evolution. CUB primitives are designed
	to easily accommodate new features in the CUDA programming model, e.g., thread
	subgroups and named barriers, dynamic shared memory allocators, etc.


	How do CUB collectives work?
	==================================================

	Four programming idioms are central to the design of CUB:

	#. :ref:`Generic programming <generic-programming>`. C++ templates provide the flexibility
	and adaptive code generation needed for CUB primitives to be useful, reusable, and
	fast in arbitrary kernel settings.
	#. :ref:`Reflective class interfaces <reflective-class-interfaces>`.
	CUB collectives statically export their their resource requirements
	(e.g., shared memory size and layout) for a given specialization, which allows compile-time
	tuning decisions and resource allocation.
	#. :ref:`Flexible data arrangement across threads <flexible-data-arrangement>`.
	CUB collectives operate on data that is logically partitioned across a group of threads.
	For most collective operations, efficiency is increased with increased granularity
	(i.e., items per thread).
	#. :ref:`Static tuning and co-tuning <static-tuning-and-co-tuning>`. Simple constants and static
	types dictate the granularities and algorithmic alternatives to be employed by CUB collectives.
	When the enclosing kernel is similarly parameterized, an optimal configuration can be determined
	that best accommodates the combined behavior and resource consumption of all primitives within
	the kernel.


	.. _generic-programming:

	Generic programming
	--------------------------------------------------

	We use template parameters to specialize CUB primitives for the particular
	problem setting at hand. Until compile time, CUB primitives are not bound
	to any particular:

	* Data type (int, float, double, etc.)
	* Width of parallelism (threads per thread block)
	* Grain size (data items per thread)
	* Underlying processor (special instructions, warp size, rules for bank conflicts, etc.)
	* Tuning configuration (e.g., latency vs. throughput, algorithm selection, etc.)


	.. _reflective-class-interfaces:

	Reflective class interfaces
	--------------------------------------------------

	Unlike traditional function-oriented interfaces, CUB exposes its collective
	primitives as templated C++ classes. The resource requirements for a specific
	parameterization are reflectively advertised as members of the class. The
	resources can then be statically or dynamically allocated, aliased
	to global or shared memory, etc. The following illustrates a CUDA kernel
	fragment performing a collective prefix sum across the threads of a thread block:

	.. code-block:: c++

	#include <cub/cub.cuh>

	__global__ void SomeKernelFoo(...)
	{
	// Specialize BlockScan for 128 threads on integer types
	typedef cub::BlockScan<int, 128> BlockScan;

	// Allocate shared memory for BlockScan
	__shared__ typename BlockScan::TempStorage scan_storage;

	...

	// Obtain a segment of consecutive items that are blocked across threads
	int thread_data_in[4];
	int thread_data_out[4];
	...

	// Perform an exclusive block-wide prefix sum
	BlockScan(scan_storage).ExclusiveSum(thread_data_in, thread_data_out);

	Furthermore, the CUB interface is designed to separate parameter
	fields by concerns. CUB primitives have three distinct parameter fields:

	#. Static template parameters. These are constants that will
	dictate the storage layout and the unrolling of algorithmic steps (e.g.,
	the input data type and the number of block threads), and are used to specialize the class.
	#. Constructor parameters. These are optional parameters regarding
	inter-thread communication (e.g., storage allocation, thread-identifier mapping,
	named barriers, etc.), and are orthogonal to the functions exposed by the class.
	#. Formal method parameters. These are the operational inputs/outputs
	for the various functions exposed by the class.

	This allows CUB types to easily accommodate new
	programming model features (e.g., named barriers, memory allocators, etc.)
	without incurring a combinatorial growth of interface methods.


	.. _flexible-data-arrangement:

	Flexible data arrangement across threads
	--------------------------------------------------

	CUDA kernels are often designed such that each thread block is assigned a
	segment of data items for processing.

	.. figure:: img/tile.png
	:align: center
	:alt: Segment of eight ordered data items
	:name: fig_tile

	Segment of eight ordered data items

	When the tile size equals the thread block size, the
	mapping of data onto threads is straightforward (one datum per thread).
	However, there are often performance advantages for processing more
	than one datum per thread. Increased granularity corresponds to
	decreased communication overhead. For these scenarios, CUB primitives
	will specify which of the following partitioning alternatives they
	accommodate:


	.. list-table::
	:class: table-no-stripes
	:widths: 70 30

	* - Blocked arrangement. The aggregate tile of items is partitioned
	evenly across threads in "blocked" fashion with thread\ :sub:`i`
	owning the i\ :sup:`th` segment of consecutive elements.
	Blocked arrangements are often desirable for algorithmic benefits (where
	long sequences of items can be processed sequentially within each thread).

	- .. figure:: img/blocked.png
	:align: center
	:alt: Blocked arrangement across four threads
	:name: fig_blocked

	Blocked arrangement across four threads

	(emphasis on items owned by thread\ :sub:`0`)

	* - Striped arrangement. The aggregate tile of items is partitioned across threads in "striped"
	fashion, i.e., the ``ITEMS_PER_THREAD`` items owned by each thread have logical stride
	``BLOCK_THREADS`` between them. Striped arrangements are often desirable for data movement through
	global memory (where
	`read/write coalescing <https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory>`_
	is an important performance consideration).
	- .. figure:: img/striped.png
	:align: center
	:alt: Striped arrangement across four threads
	:name: fig_striped

	Striped arrangement across four threads

	(emphasis on items owned by thread\ :sub:`0`)

	The benefits of processing multiple items per thread (a.k.a., register blocking,
	granularity coarsening, etc.) include:

	* Algorithmic efficiency. Sequential work over multiple items in
	thread-private registers is cheaper than synchronized, cooperative
	work through shared memory spaces.
	* Data occupancy. The number of items that can be resident on-chip in
	thread-private register storage is often greater than the number of
	schedulable threads.
	* Instruction-level parallelism. Multiple items per thread also
	facilitates greater ILP for improved throughput and utilization.

	Finally, cub::BlockExchange provides operations for converting between blocked
	and striped arrangements.


	.. _static-tuning-and-co-tuning:

	Static tuning and co-tuning
	--------------------------------------------------

	This style of flexible interface simplifies performance tuning. Most CUB
	primitives support alternative algorithmic strategies that can be
	statically targeted by a compiler-based or JIT-based autotuner. (For
	example, cub::BlockHistogram is parameterized to implement either an
	atomic-based approach or a sorting-based approach.) Algorithms are also
	tunable over parameters such as thread count and grain size as well.
	Taken together, each of the CUB algorithms provides a fairly rich tuning
	space.

	Whereas conventional libraries are optimized offline and in isolation, CUB
	provides interesting opportunities for whole-program optimization. For
	example, each CUB primitive is typically parameterized by threads-per-block
	and items-per-thread, both of which affect the underlying algorithm's
	efficiency and resource requirements. When the enclosing kernel is similarly
	parameterized, the coupled CUB primitives adjust accordingly. This enables
	autotuners to search for a single configuration that maximizes the performance
	of the entire kernel for a given set of hardware resources.


	How do I get started using CUB?
	==================================================

	CUB is implemented as a C++ header library. There is no need to build CUB
	separately. To use CUB primitives in your code, simply:

	#. Download and unzip the latest CUB distribution
	#. ``#include`` the "umbrella" ``<cub/cub.cuh>`` header file in
	your CUDA C++ sources. (Or ``#include`` the particular
	header files that define the CUB primitives you wish to use.)
	#. Compile your program with NVIDIA's ``nvcc`` CUDA compiler,
	specifying a ``-I<path-to-CUB>`` include-path flag to reference
	the location of the CUB header library.

	We also have collection of simple CUB example programs.


	How is CUB different than Thrust and Modern GPU?
	==================================================


	CUB and Thrust
	--------------------------------------------------

	CUB and `Thrust <http://thrust.github.io/>`_ share some
	similarities in that they both provide similar device-wide primitives for CUDA.
	However, they target different abstraction layers for parallel computing.
	Thrust abstractions are agnostic of any particular parallel framework (e.g.,
	CUDA, TBB, OpenMP, sequential CPU, etc.). While Thrust has a "backend"
	for CUDA devices, Thrust interfaces themselves are not CUDA-specific and
	do not explicitly expose CUDA-specific details (e.g., ``cudaStream_t`` parameters).

	CUB, on the other hand, is slightly lower-level than Thrust. CUB is specific
	to CUDA C++ and its interfaces explicitly accommodate CUDA-specific features.
	Furthermore, CUB is also a library of SIMT collective primitives for block-wide
	and warp-wide kernel programming.

	CUB and Thrust are complementary and can be used together. In fact, the CUB
	project arose out of a maintenance need to achieve better performance-portability
	within Thrust by using reusable block-wide primitives to reduce maintenance and
	tuning effort.


	CUB and Modern GPU
	--------------------------------------------------

	CUB and `Modern GPU <https://github.com/moderngpu/moderngpu>`_ also
	share some similarities in that they both implement similar device-wide primitives for CUDA.
	However, they serve different purposes for the CUDA programming community. MGPU
	is a pedagogical tool for high-performance GPU computing, providing clear and concise
	exemplary code and accompanying commentary. It serves as an excellent source of
	educational, tutorial, CUDA-by-example material. The MGPU source code is intended
	to be read and studied, and often favors simplicity at the expense of portability and
	flexibility.

	CUB, on the other hand, is a production-quality library whose sources are complicated
	by support for every version of CUDA architecture, and is validated by an extensive
	suite of regression tests. Although well-documented, the CUB source text is verbose
	and relies heavily on C++ template metaprogramming for situational specialization.

	CUB and MGPU are complementary in that MGPU serves as an excellent descriptive source
	for many of the algorithmic techniques used by CUB.


	Stable releases
	==================================================

	CUB releases are labeled using version identifiers having three fields:
	``<epoch>.<feature>.<update>``. The epoch field
	corresponds to support for a major change or update to the CUDA programming model.
	The feature field corresponds to a stable set of features,
	functionality, and interface. The update field corresponds to a
	bug-fix or performance update for that feature set. At the moment, we do
	not publicly provide non-stable releases such as development snapshots,
	beta releases or rolling releases. (Feel free to contact us if you would
	like access to such things.)


	Contributors
	==================================================

	CUB is developed as an open-source project by NVIDIA.
	The primary contributor is the CCCL team.


	Open Source License
	==================================================

	CUB is available under the `BSD 3-Clause "New" or "Revised" License <https://github.com/NVIDIA/cub/blob/main/LICENSE.TXT>`_