File size: 6,188 Bytes
8ae5fc5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | ---
grand_parent: Extended API
parent: Synchronization Primitives
---
# `cuda::pipeline`
Defined in header `<cuda/pipeline>`:
```cuda
template <cuda::thread_scope Scope>
class cuda::pipeline {
public:
pipeline() = delete;
__host__ __device__ ~pipeline();
pipeline& operator=(pipeline const&) = delete;
__host__ __device__ void producer_acquire();
__host__ __device__ void producer_commit();
__host__ __device__ void consumer_wait();
template <typename Rep, typename Period>
__host__ __device__ bool consumer_wait_for(
cuda::std::chrono::duration<Rep, Period> const& duration);
template <typename Clock, typename Duration>
__host__ __device__
bool consumer_wait_until(
cuda::std::chrono::time_point<Clock, Duration> const& time_point);
__host__ __device__ void consumer_release();
__host__ __device__ bool quit();
};
```
The class template `cuda::pipeline` provides a coordination mechanism which
can sequence [asynchronous operations], such as [`cuda::memcpy_async`], into
stages.
A thread interacts with a _pipeline stage_ using the following pattern:
1. Acquire the pipeline stage.
2. Commit some operations to the stage.
3. Wait for the previously committed operations to complete.
4. Release the pipeline stage.
For [`cuda::thread_scope`]s other than `cuda::thread_scope_thread`, a
[`cuda::pipeline_shared_state`] is required to coordinate the participating
threads.
_Pipelines_ can be either _unified_ or _partitioned_.
In a _unified pipeline_, all the participating threads are both producers and
consumers.
In a _partitioned pipeline_, each participating thread is either a producer or
a consumer.
## Template Parameters
| `Scope` | The scope of threads participating in the _pipeline_. |
## Member Functions
| (constructor) [deleted] | `cuda::pipeline` is not constructible. |
| [(destructor)] | Destroys the `cuda::pipeline`. |
| `operator=` [deleted] | `cuda::pipeline` is not assignable. |
| [`producer_acquire`] | Blocks the current thread until the next _pipeline stage_ is available. |
| [`producer_commit`] | Commits operations previously issued by the current thread to the current _pipeline stage_. |
| [`consumer_wait`] | Blocks the current thread until all operations committed to the current _pipeline stage_ complete. |
| [`consumer_wait_for`] | Blocks the current thread until all operations committed to the current _pipeline stage_ complete or after the specified timeout duration. |
| [`consumer_wait_until`] | Blocks the current thread until all operations committed to the current _pipeline stage_ complete or until specified time point has been reached. |
| [`consumer_release`] | Release the current _pipeline stage_. |
| [`quit`] | Quits current thread's participation in the _pipeline_. |
## Notes
A thread role cannot change during the lifetime of the pipeline object.
## Example
```cuda
#include <cuda/pipeline>
#include <cooperative_groups.h>
// Disables `pipeline_shared_state` initialization warning.
#pragma diag_suppress static_var_with_dynamic_init
template <typename T>
__device__ void compute(T* ptr);
template <typename T>
__global__ void example_kernel(T* global0, T* global1, cuda::std::size_t subset_count) {
extern __shared__ T s[];
auto group = cooperative_groups::this_thread_block();
T* shared[2] = { s, s + 2 * group.size() };
// Create a pipeline.
constexpr auto scope = cuda::thread_scope_block;
constexpr auto stages_count = 2;
__shared__ cuda::pipeline_shared_state<scope, stages_count> shared_state;
auto pipeline = cuda::make_pipeline(group, &shared_state);
// Prime the pipeline.
pipeline.producer_acquire();
cuda::memcpy_async(group, shared[0],
&global0[0], sizeof(T) * group.size(), pipeline);
cuda::memcpy_async(group, shared[0] + group.size(),
&global1[0], sizeof(T) * group.size(), pipeline);
pipeline.producer_commit();
// Pipelined copy/compute.
for (cuda::std::size_t subset = 1; subset < subset_count; ++subset) {
pipeline.producer_acquire();
cuda::memcpy_async(group, shared[subset % 2],
&global0[subset * group.size()],
sizeof(T) * group.size(), pipeline);
cuda::memcpy_async(group, shared[subset % 2] + group.size(),
&global1[subset * group.size()],
sizeof(T) * group.size(), pipeline);
pipeline.producer_commit();
pipeline.consumer_wait();
compute(shared[(subset - 1) % 2]);
pipeline.consumer_release();
}
// Drain the pipeline.
pipeline.consumer_wait();
compute(shared[(subset_count - 1) % 2]);
pipeline.consumer_release();
}
template void __global__ example_kernel<int>(int*, int*, cuda::std::size_t);
```
[See it on Godbolt](https://godbolt.org/z/zc41bWvja){: .btn }
[asynchronous operations]: ../asynchronous_operations.md
[`cuda::memcpy_async`]: ../asynchronous_operations/memcpy_async.md
[`cuda::thread_scope`]: ../thread_scopes.md
[`cuda::pipeline_shared_state`]: ./pipeline_shared_state.md
[(destructor)]: ./pipeline/destructor.md
[`producer_acquire`]: ./pipeline/producer_acquire.md
[`producer_commit`]: ./pipeline/producer_commit.md
[`consumer_wait`]: ./pipeline/consumer_wait.md
[`consumer_wait_for`]: ./pipeline/consumer_wait.md
[`consumer_wait_until`]: ./pipeline/consumer_wait.md
[`consumer_release`]: ./pipeline/consumer_release.md
[`quit`]: ./pipeline/quit.md
|