File size: 10,475 Bytes
66c9c8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization")

[README](/README.md#documentation) > **Code Organization**

# CUTLASS Code Organization

This document describes the layout of the CUTLASS repository. The main components are:

* **CUTLASS Template Library** - CUDA Templates for Linear Algebra Subroutines and Solvers (header only)
* **CUTLASS Utilities** - Additional templates 
* **CUTLASS Instance Library** - instantiations of CUTLASS templates covering the design space
* **CUTLASS Profiler** - CUTLASS Library, Profiler, and Utilities
* **Examples** - SDK examples of CUTLASS Template Library and components
* **Media** - supporting documentation and media content
* **Tests** - test components for CUTLASS Template Library and tools

## CUTLASS Template Library

CUDA Templates for Linear Algebra Subroutines and Solvers is a library of CUDA C++ template classes for
performing efficient matrix computations on NVIDIA GPUs.

Like NVIDIA CUB, the components of CUTLASS are organized hierarchically based on the scope of cooperative
elements. For example, warp-level GEMM components perform a matrix multiply collectively by the
set of threads within a warp. The following figure illustrates each layer.

Components are designed to be usable by client applications accessing functionailty at each scope.

CUTLASS Templates are implemented by header files in the following directory structure:

```
include/                     # Top-level include directory. Client applications should target this path.

  cutlass/                   # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only

    arch/                    # direct exposure of architecture features (including instruction-level GEMMs)
      *
    gemm/                    # code specialized for general matrix product computations
      thread/                #   thread-level operators
      warp/                  #   warp-level operators
      threadblock/           #   CTA-level operators
      kernel/                #   CUDA kernel entry points
      device/                #   launches kernel(s) over a full device
      *                      # scope-agnostic components and basic vocabular type definitions for GEMM

    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory
      *

    reduction/               # bandwidth-limited reduction kernels that do not fit the "gemm" models
      thread/                #   thread-level operators
      warp/                  #   warp-level operators
      threadblock/           #   CTA-level operators
      kernel/                #   CUDA kernel entry points
      device/                #   launches kernel(s) over a full device
      *                      # scope-agnostic components and basic vocabular type definitions

    transform/               # code specialized for layout, type, and domain transformations
      thread/                #   thread-level operators
      warp/                  #   warp-level operators
      threadblock/           #   CTA-level operators
      kernel/                #   CUDA kernel entry points
      device/                #   launches kernel(s) over a full device
      *                      # scope-agnostic components and basic vocabulary type definitions

    util/                    # miscellaneous CUTLASS components
      *
    *                        # core vocabulary types and fundamental arithmetic operators
```

See [Programming Guidelines](/media/docs/programming_guidelines.md) for further details about
conventions and design patterns used throughout CUTLASS.

## Tools

The `tools/` directory contains clients of the CUTLASS Template library and includes the following.

## CUTLASS Instance Library

The CUTLASS Instance Library contains instantiations of the above CUTLASS templates covering supported configurations,
data types, block structure, and tile sizes. These instantiations are procedurally generated using a set of 
scripts to span the design space.

```
tools/
  library/                   # static/dynamic library containing all kernel instantiations of interest
                             # (with some build-level filter switches to compile specific subsets)

    include/
      cutlass/
        library/             # header files for CUTLASS Deliverables Library (in cutlass::library:: namespace)

          handle.h           # implements a host-side API for launching kernels, similar to cuBLAS
          library.h          # defines enums and structs to describe the tiled structure of operator instances          
          manifest.h         # collection of all instances

    scripts/                 # scripts to procedurally generate CUTLASS template instances

      gemm_operations.py
      library.py
      generator.py           # entry point of procedural generation scripts - invoked by cmake
      manifest.py

    src/
```

When CMake is executed, the CUTLASS Instance Library generator scripts are executed to construct a set of
instantiations in `build/tools/library/generated/`.

### CUTLASS Profiler

The CUTLASS Profiler is designed to load the CUTLASS Instance Library and execute all operations contained therein.
This command-line driven application constructs an execution environment for evaluating functionality and performance. 
It is implemented in
```
tools/
  profiler/
```

and may be built as follows.
```
$ make cutlass_profiler -j
```

[Further details about the CUTLASS Profiler are described here.](/media/docs/profiler.md)

### CUTLASS Utilities

`tools/util/` defines a companion library of headers and sources that support the CUTLASS test programs, examples, and other client applications. Its structure is as follows:

```
tools/
  util/
    include/
      cutlass/
        util/                   # CUTLASS Utility companion library

          reference/            #  functional reference implementation of CUTLASS operators
                                #    (minimal consideration for performance)
            
            detail/
              *

            device/             #  device-side reference implementations of CUTLASS operators
              thread/
              kernel/
                *
            host/               #  host-side reference implementations of CUTLASS operators
              *
          *
```

[More details about CUTLASS Utilities may be found here.](/media/docs/utilities.md)


## Examples

To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`. 

CUTLASS SDK examples apply CUTLASS templates to implement basic computations.

```
examples/
  00_basic_gemm/             # launches a basic GEMM with single precision inputs and outputs

  01_cutlass_utilities/      # demonstrates CUTLASS Utilities for allocating and initializing tensors
  
  02_dump_reg_smem/          # debugging utilities for printing register and shared memory contents
  
  03_visualize_layout/       # utility for visualizing all layout functions in CUTLASS

  04_tile_iterator/          # example demonstrating an iterator over tiles in memory

  05_batched_gemm/           # example demonstrating CUTLASS's batched strided GEMM operation

  06_splitK_gemm/            # exmaple demonstrating CUTLASS's Split-K parallel reduction kernel

  07_volta_tensorop_gemm/    # example demonstrating mixed precision GEMM using Volta Tensor Cores

  08_turing_tensorop_gemm/   # example demonstrating integer GEMM using Turing Tensor Cores

  10_planar_complex/         # example demonstrating planar complex GEMM kernels

  11_planar_complex_array/   # example demonstrating planar complex kernels with batch-specific problem sizes

  12_gemm_bias_relu/         # example demonstrating GEMM fused with bias and relu

  13_fused_two_gemms/        # example demonstrating two GEMms fused in one kernel
```

## Media

This directory contains documentation, images, and performance result data which accompanies the CUTLASS library and components.

## Tests

Test programs for CUTLASS. Tests are organized hierarchically, mirroring the organization of source files.
```
test/                        # unit tests for CUTLASS Template Library
  unit/
    arch/
    core/
    gemm/
      device/
      kernel/
      thread/
      threadblock/
      warp/
    reduction/
      kernel/
      thread/
    transform/
      threadblock/
      *
```
Tests can be built and run at the top level scope by invoking `make test_unit` or by building
and explicitly executing each individual target, e.g. `cutlass_test_unit_gemm_device`.

Tests are configured to specify appropriate GTest filter strings to avoid running except on
architectures where they are expected to pass. Thus, no tests should fail. The actual number
of tests run may vary over time as more are added.

# Copyright

Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

```
  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

  3. Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
```