thanks to nvidia ❤

8ae5fc5 over 2 years ago

6.19 kB

	---
	grand_parent: Extended API
	parent: Synchronization Primitives
	---

	# `cuda::pipeline`

	Defined in header `<cuda/pipeline>`:

	```cuda
	template <cuda::thread_scope Scope>
	class cuda::pipeline {
	public:
	pipeline() = delete;

	__host__ __device__ ~pipeline();

	pipeline& operator=(pipeline const&) = delete;

	__host__ __device__ void producer_acquire();

	__host__ __device__ void producer_commit();

	__host__ __device__ void consumer_wait();

	template <typename Rep, typename Period>
	__host__ __device__ bool consumer_wait_for(
	cuda::std::chrono::duration<Rep, Period> const& duration);

	template <typename Clock, typename Duration>
	__host__ __device__
	bool consumer_wait_until(
	cuda::std::chrono::time_point<Clock, Duration> const& time_point);

	__host__ __device__ void consumer_release();

	__host__ __device__ bool quit();
	};
	```

	The class template `cuda::pipeline` provides a coordination mechanism which
	can sequence [asynchronous operations], such as [`cuda::memcpy_async`], into
	stages.

	A thread interacts with a _pipeline stage_ using the following pattern:
	1. Acquire the pipeline stage.
	2. Commit some operations to the stage.
	3. Wait for the previously committed operations to complete.
	4. Release the pipeline stage.

	For [`cuda::thread_scope`]s other than `cuda::thread_scope_thread`, a
	[`cuda::pipeline_shared_state`] is required to coordinate the participating
	threads.

	_Pipelines_ can be either _unified_ or _partitioned_.
	In a _unified pipeline_, all the participating threads are both producers and
	consumers.
	In a _partitioned pipeline_, each participating thread is either a producer or
	a consumer.

	## Template Parameters

	\| `Scope` \| The scope of threads participating in the _pipeline_. \|

	## Member Functions

	\| (constructor) [deleted] \| `cuda::pipeline` is not constructible. \|
	\| [(destructor)] \| Destroys the `cuda::pipeline`. \|
	\| `operator=` [deleted] \| `cuda::pipeline` is not assignable. \|
	\| [`producer_acquire`] \| Blocks the current thread until the next _pipeline stage_ is available. \|
	\| [`producer_commit`] \| Commits operations previously issued by the current thread to the current _pipeline stage_. \|
	\| [`consumer_wait`] \| Blocks the current thread until all operations committed to the current _pipeline stage_ complete. \|
	\| [`consumer_wait_for`] \| Blocks the current thread until all operations committed to the current _pipeline stage_ complete or after the specified timeout duration. \|
	\| [`consumer_wait_until`] \| Blocks the current thread until all operations committed to the current _pipeline stage_ complete or until specified time point has been reached. \|
	\| [`consumer_release`] \| Release the current _pipeline stage_. \|
	\| [`quit`] \| Quits current thread's participation in the _pipeline_. \|

	## Notes

	A thread role cannot change during the lifetime of the pipeline object.

	## Example

	```cuda
	#include <cuda/pipeline>
	#include <cooperative_groups.h>

	// Disables `pipeline_shared_state` initialization warning.
	#pragma diag_suppress static_var_with_dynamic_init

	template <typename T>
	__device__ void compute(T* ptr);

	template <typename T>
	__global__ void example_kernel(T* global0, T* global1, cuda::std::size_t subset_count) {
	extern __shared__ T s[];
	auto group = cooperative_groups::this_thread_block();
	T* shared[2] = { s, s + 2 * group.size() };

	// Create a pipeline.
	constexpr auto scope = cuda::thread_scope_block;
	constexpr auto stages_count = 2;
	__shared__ cuda::pipeline_shared_state<scope, stages_count> shared_state;
	auto pipeline = cuda::make_pipeline(group, &shared_state);

	// Prime the pipeline.
	pipeline.producer_acquire();
	cuda::memcpy_async(group, shared[0],
	&global0[0], sizeof(T) * group.size(), pipeline);
	cuda::memcpy_async(group, shared[0] + group.size(),
	&global1[0], sizeof(T) * group.size(), pipeline);
	pipeline.producer_commit();

	// Pipelined copy/compute.
	for (cuda::std::size_t subset = 1; subset < subset_count; ++subset) {
	pipeline.producer_acquire();
	cuda::memcpy_async(group, shared[subset % 2],
	&global0[subset * group.size()],
	sizeof(T) * group.size(), pipeline);
	cuda::memcpy_async(group, shared[subset % 2] + group.size(),
	&global1[subset * group.size()],
	sizeof(T) * group.size(), pipeline);
	pipeline.producer_commit();
	pipeline.consumer_wait();
	compute(shared[(subset - 1) % 2]);
	pipeline.consumer_release();
	}

	// Drain the pipeline.
	pipeline.consumer_wait();
	compute(shared[(subset_count - 1) % 2]);
	pipeline.consumer_release();
	}

	template void __global__ example_kernel<int>(int, int, cuda::std::size_t);
	```

	[See it on Godbolt](https://godbolt.org/z/zc41bWvja){: .btn }


	[asynchronous operations]: ../asynchronous_operations.md
	[`cuda::memcpy_async`]: ../asynchronous_operations/memcpy_async.md

	[`cuda::thread_scope`]: ../thread_scopes.md
	[`cuda::pipeline_shared_state`]: ./pipeline_shared_state.md

	[(destructor)]: ./pipeline/destructor.md
	[`producer_acquire`]: ./pipeline/producer_acquire.md
	[`producer_commit`]: ./pipeline/producer_commit.md
	[`consumer_wait`]: ./pipeline/consumer_wait.md
	[`consumer_wait_for`]: ./pipeline/consumer_wait.md
	[`consumer_wait_until`]: ./pipeline/consumer_wait.md
	[`consumer_release`]: ./pipeline/consumer_release.md
	[`quit`]: ./pipeline/quit.md