File size: 6,188 Bytes
8ae5fc5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
grand_parent: Extended API
parent: Synchronization Primitives
---

# `cuda::pipeline`

Defined in header `<cuda/pipeline>`:

```cuda
template <cuda::thread_scope Scope>
class cuda::pipeline {
public:
  pipeline() = delete;

  __host__ __device__ ~pipeline();

  pipeline& operator=(pipeline const&) = delete;

  __host__ __device__ void producer_acquire();

  __host__ __device__ void producer_commit();

  __host__ __device__ void consumer_wait();

  template <typename Rep, typename Period>
  __host__ __device__ bool consumer_wait_for(
    cuda::std::chrono::duration<Rep, Period> const& duration);

  template <typename Clock, typename Duration>
  __host__ __device__
  bool consumer_wait_until(
    cuda::std::chrono::time_point<Clock, Duration> const& time_point);

  __host__ __device__ void consumer_release();

  __host__ __device__ bool quit();
};
```

The class template `cuda::pipeline` provides a coordination mechanism which
  can sequence [asynchronous operations], such as [`cuda::memcpy_async`], into
  stages.

A thread interacts with a _pipeline stage_ using the following pattern:
1. Acquire the pipeline stage.
2. Commit some operations to the stage.
3. Wait for the previously committed operations to complete.
4. Release the pipeline stage.

For [`cuda::thread_scope`]s other than `cuda::thread_scope_thread`, a
  [`cuda::pipeline_shared_state`] is required to coordinate the participating
  threads.

_Pipelines_ can be either _unified_ or _partitioned_.
In a _unified pipeline_, all the participating threads are both producers and
  consumers.
In a _partitioned pipeline_, each participating thread is either a producer or
  a consumer.

## Template Parameters

| `Scope` | The scope of threads participating in the _pipeline_. |

## Member Functions

| (constructor) [deleted] | `cuda::pipeline` is not constructible.                                                                                                            |
| [(destructor)]          | Destroys the `cuda::pipeline`.                                                                                                                    |
| `operator=` [deleted]   | `cuda::pipeline` is not assignable.                                                                                                               |
| [`producer_acquire`]    | Blocks the current thread until the next _pipeline stage_ is available.                                                                           |
| [`producer_commit`]     | Commits operations previously issued by the current thread to the current _pipeline stage_.                                                       |
| [`consumer_wait`]       | Blocks the current thread until all operations committed to the current _pipeline stage_ complete.                                                |
| [`consumer_wait_for`]   | Blocks the current thread until all operations committed to the current _pipeline stage_ complete or after the specified timeout duration.        |
| [`consumer_wait_until`] | Blocks the current thread until all operations committed to the current _pipeline stage_ complete or until specified time point has been reached. |
| [`consumer_release`]    | Release the current _pipeline stage_.                                                                                                             |
| [`quit`]                | Quits current thread's participation in the _pipeline_.                                                                                           |

## Notes

A thread role cannot change during the lifetime of the pipeline object.

## Example

```cuda
#include <cuda/pipeline>
#include <cooperative_groups.h>

// Disables `pipeline_shared_state` initialization warning.
#pragma diag_suppress static_var_with_dynamic_init

template <typename T>
__device__ void compute(T* ptr);

template <typename T>
__global__ void example_kernel(T* global0, T* global1, cuda::std::size_t subset_count) {
  extern __shared__ T s[];
  auto group = cooperative_groups::this_thread_block();
  T* shared[2] = { s, s + 2 * group.size() };

  // Create a pipeline.
  constexpr auto scope = cuda::thread_scope_block;
  constexpr auto stages_count = 2;
  __shared__ cuda::pipeline_shared_state<scope, stages_count> shared_state;
  auto pipeline = cuda::make_pipeline(group, &shared_state);

  // Prime the pipeline.
  pipeline.producer_acquire();
  cuda::memcpy_async(group, shared[0],
                     &global0[0], sizeof(T) * group.size(), pipeline);
  cuda::memcpy_async(group, shared[0] + group.size(),
                     &global1[0], sizeof(T) * group.size(), pipeline);
  pipeline.producer_commit();

  // Pipelined copy/compute.
  for (cuda::std::size_t subset = 1; subset < subset_count; ++subset) {
    pipeline.producer_acquire();
    cuda::memcpy_async(group, shared[subset % 2],
                       &global0[subset * group.size()],
                       sizeof(T) * group.size(), pipeline);
    cuda::memcpy_async(group, shared[subset % 2] + group.size(),
                       &global1[subset * group.size()],
                       sizeof(T) * group.size(), pipeline);
    pipeline.producer_commit();
    pipeline.consumer_wait();
    compute(shared[(subset - 1) % 2]);
    pipeline.consumer_release();
  }

  // Drain the pipeline.
  pipeline.consumer_wait();
  compute(shared[(subset_count - 1) % 2]);
  pipeline.consumer_release();
}

template void __global__ example_kernel<int>(int*, int*, cuda::std::size_t);
```

[See it on Godbolt](https://godbolt.org/z/zc41bWvja){: .btn }


[asynchronous operations]: ../asynchronous_operations.md
[`cuda::memcpy_async`]: ../asynchronous_operations/memcpy_async.md

[`cuda::thread_scope`]: ../thread_scopes.md
[`cuda::pipeline_shared_state`]: ./pipeline_shared_state.md

[(destructor)]: ./pipeline/destructor.md
[`producer_acquire`]: ./pipeline/producer_acquire.md
[`producer_commit`]: ./pipeline/producer_commit.md
[`consumer_wait`]: ./pipeline/consumer_wait.md
[`consumer_wait_for`]: ./pipeline/consumer_wait.md
[`consumer_wait_until`]: ./pipeline/consumer_wait.md
[`consumer_release`]: ./pipeline/consumer_release.md
[`quit`]: ./pipeline/quit.md