Spaces:

qbhf2
/

GarmentCode

Sleeping

App Files Files Community

GarmentCode / NvidiaWarp-GarmentCode /warp /native /cutlass /media /docs /profiler.md

qbhf2

added NvidiaWarp and GarmentCode repos

66c9c8a 11 months ago

preview code

raw

history blame contribute delete

26.5 kB

	![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Profiler")

	[README](/README.md#documentation) > CUTLASS Profiler

	# CUTLASS Profiler

	The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
	defined in the CUTLASS Instance Library. The CUTLASS Profiler is capable of executing each GEMM, Sparse Gemm,
	Conv2d, and Conv3d kernel.

	The CUTLASS Profiler may be compiled with:
	```bash
	$ make cutlass_profiler -j
	```

	To limit compilation time, only one tile size (typically 128x128) is instantiated for each data type,
	math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an
	empty `build/` directory.
	```bash
	$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
	...
	$ make cutlass_profiler -j
	```
	Enabling the unity build places multiple kernel instances in one compilation unit, thereby reducing size of the compiled
	binary and avoiding linker limitations on some platforms.

	The CUTLASS Profiler sources are stored in
	```bash
	tools/
	profiler/
	```

	The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
	```bash
	CUTLASS Performance Tool
	usage:

	cutlass_profiler [options]

	--help

	--mode=<string> Cutlass profiler execution mode.
	--mode=profile regular verification and profiling (default)
	--mode=dry_run no kernels are launched or workspaces allocated
	--mode=enumerate lists all operation kind and operations
	--mode=trace executes a single device-side computation with
	no other kernel launches

	--device-info Prints information on all GPUs present in the system

	--operation=<operation_kind> CUTLASS operation to profile.

	--kernels=<string_list> Filter operations by kernel names. For example, call all kernels with
	("s1688" and "nt") or ("s844" and "tn" and "align8") in their
	operation name using --kernels="s1688nt, s884tn*align8"

	--ignore-kernels=<string_list> Excludes kernels whose names match anything in this list.

	Device:
	--device=<int> CUDA Device ID

	--compute-capability=<int> Override the compute capability.

	--llc-capacity=<capacity in KiB> Capacity of last-level cache in kilobytes. If this is non-zero,
	profiling phases cycle through different input tensors to induce
	capacity misses in the L2.


	Initialization:
	--initialization=<bool> Enables initialization (default: true). If false, device memory is
	not initialized after allocation.

	--initialization-provider=<provider> Selects initialization provider {host, device}. (default: '')

	--dist=<distribution> Data distribution of input tensors {uniform*, gaussian, identity, sequential}
	--dist=uniform,min:<double>,max:<double>,scale:<integer>
	--dist=gaussian,mean:<double>,stddev:<double>,scale:<integer>
	--dist=sequential,start:<double>,delta:<double>,scale:<integer>
	--dist=identity

	--seed=<int> Random number generator seed. Used to enforce deterministic
	initialization.


	Library:
	--library-algo-mode=<mode> Indicates algorithm mode used to call libraries such as cuBLAS and cuDNN.
	mode={default*,matching,best}

	--library-algos=<range-list> If --algorithm-mode=best, permits specifying a selection of algorithms.


	Profiling:
	--workspace-count=<workspace count> Number of discrete workspaces maintained to avoid cache-resident
	If zero (default), the amount is chosen for each workload based on
	capacity of the last-level cache.

	--profiling-iterations=<iterations> Number of iterations to profile each kernel. If zero, kernels
	are launched up to the profiling duration.

	--warmup-iterations=<iterations> Number of iterations to execute each kernel prior to profiling.

	--sleep-duration=<duration> Number of ms to sleep between profiling periods (ms).

	--profiling-enabled=<bool> If true, profiling is actually conducted.

	Verification:
	--verification-enabled=<bool> Whether to perform verification checks.

	--epsilon=<error> Error threshold. Setting to zero (default) requires
	bit-level equivalence.

	--nonzero-floor=<floor> Results whose absolute value is less than this quantity
	are treated as zero for comparisons.

	--save-workspace=<string> Specifies when to save the GEMM inputs and results to the filesystem.
	--save-workspace=never never save workspace (default)
	--save-workspace=incorrect save workspace for incorrect results
	--save-workspace=always always save workspace

	--verification-providers=<providers> List of providers used to verify result. (default: '*')
	Gemm verification-providers {cublas*}
	Conv2d verification-providers {cudnn, device, host}


	Report:
	--append=<bool> If true, result is appended to possibly existing file. Otherwise,
	any existing file is overwritten.

	--output=<path> Path to output file for machine readable results. Operation kind and '.csv' is appended.

	--junit-output=<path> Path to junit output file for result reporting. Operation kind and '.junit.xml' is appended.

	--report-not-run=<bool> If true, reports the status of all kernels including those that
	do not satisfy the given arguments.

	--tags=<column:tag,...> Inserts leading columns in output table and uniform values for each
	column. Useful for generating pivot tables.

	--verbose=<bool> Prints human-readable text to stdout. If false, nothing is written to stdout.


	About:
	--version CUTLASS 2.4.0 built on Nov 19 2020 at 11:59:00


	Operations:

	gemm General matrix-matrix product. D = alpha * AB + beta C
	spgemm Structured sparse GEMM. D = alpha * AB + beta C
	conv2d Conv2d operation. Output(Tensor4D) = alpha * Input(Tensor4D) * Filter(Tensor4D) + beta * Input(Tensor4D)
	conv3d Conv3d operation. Output(Tensor5D) = alpha * Input(Tensor5D) * Filter(Tensor5D) + beta * Input(Tensor5D)


	For details about a particular function, specify the function name with --help.

	Example:

	$ cutlass_profiler --operation=Gemm --help

	$ cutlass_profiler --operation=Conv3d --help

	$ cutlass_profiler --operation=Conv2d --help

	```

	# GEMM

	The CUTLASS Profiler is capable of executing GEMM and Sparse GEMM problems.

	The CUTLASS Profiler can be built with cuBLAS enabled to use as a reference implementation. If CMake detects
	the cuBLASS library available in the system, it is included as a dependency. This may be explicitly overridden
	with CMake flag `CUTLASS_ENABLE_CUBLAS`.

	## GEMM Arguments

	The complete set of arguments available to each operation may be viewed by specifying the operation name
	in addition to `--help`. The argument flags and their aliases usable for GEMM appear as follows.

	```bash
	$ ./tools/profiler/cutlass_profiler --operation=gemm --help

	GEMM

	[enum] --Gemm_kind Variant of GEMM (e.g. gemm, batched, ...)
	[int] --m,--problem-size::m M dimension of the GEMM problem space
	[int] --n,--problem-size::n N dimension of the GEMM problem space
	[int] --k,--problem-size::k K dimension of the GEMM problem space
	[tensor] --A Tensor storing the A operand
	[tensor] --B Tensor storing the B operand
	[tensor] --C Tensor storing the C operand
	[scalar] --alpha,--epilogue::alpha Epilogue scalar alpha
	[scalar] --beta,--epilogue::beta Epilogue scalar beta
	[int] --split_k_slices Number of partitions of K dimension
	[int] --batch_count Number of GEMMs computed in one batch
	[enum] --op_class,--opcode-class Class of math instruction (SIMT or TensorOp).
	[enum] --accum,--accumulator-type Math instruction accumulator data type.
	[int] --cta_m,--threadblock-shape::m Threadblock shape in the M dimension.
	[int] --cta_n,--threadblock-shape::n Threadblock shape in the N dimension.
	[int] --cta_k,--threadblock-shape::k Threadblock shape in the K dimension.
	[int] --stages,--threadblock-stages Number of stages of threadblock-scoped matrix multiply.
	[int] --warps_m,--warp-count::m Number of warps within threadblock along the M dimension.
	[int] --warps_n,--warp-count::n Number of warps within threadblock along the N dimension.
	[int] --warps_k,--warp-count::k Number of warps within threadblock along the K dimension.
	[int] --inst_m,--instruction-shape::m Math instruction shape in the M dimension.
	[int] --inst_n,--instruction-shape::n Math instruction shape in the N dimension.
	[int] --inst_k,--instruction-shape::k Math instruction shape in the K dimension.
	[int] --min_cc,--minimum-compute-capability Minimum device compute capability.
	[int] --max_cc,--maximum-compute-capability Maximum device compute capability.

	Examples:

	Profile a particular problem size:
	$ ./tools/profiler/cutlass_profiler --operation=Gemm --m=1024 --n=1024 --k=128

	Schmoo over problem size and beta:
	$ ./tools/profiler/cutlass_profiler --operation=Gemm --m=1024:4096:256 --n=1024:4096:256 --k=128:8192:128 --beta=0,1,2

	Schmoo over accumulator types:
	$ ./tools/profiler/cutlass_profiler --operation=Gemm --accumulator-type=f16,f32

	Run when A is f16 with column-major and B is any datatype with row-major
	(For column major, use column, col, or n. For row major use, row or t):

	$ ./tools/profiler/cutlass_profiler --operation=Gemm --A=f16:column --B=*:row

	Using various input value distribution:
	$ ./tools/profiler/cutlass_profiler --operation=Gemm --dist=uniform,min:0,max:3
	$ ./tools/profiler/cutlass_profiler --operation=Gemm --dist=gaussian,mean:0,stddev:3
	$ ./tools/profiler/cutlass_profiler --operation=Gemm --dist=sequential,start:0,delta:1

	Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect
	(note that --cta-tile::k=32 is default cta-tile size):
	$ ./tools/profiler/cutlass_profiler --operation=Gemm --cta_m=256 --cta_n=128 --cta_k=32 --save-workspace=incorrect

	Test your changes to gemm kernels with a quick functional test and save results in functional-test.csv:
	$ ./tools/profiler/cutlass_profiler --operation=Gemm \
	--m=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \
	--n=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \
	--k=8,16,32,64,128,256,288,384,504,512,520 \
	--beta=0,1,2 --profiling-iterations=1 \
	--output=functional-test.csv
	```

	## Example CUDA Core GEMM Operation

	Example command line for profiling SGEMM kernels is as follows:
	```bash
	$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096



	=============================
	Problem ID: 1

	Provider: CUTLASS
	OperationKind: gemm
	Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1

	Status: Success
	Verification: ON
	Disposition: Passed

	cuBLAS: Passed

	Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1 \
	--batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \
	--warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024

	Bytes: 180355072 bytes
	FLOPs: 115992428544 flops

	Runtime: 6.73655 ms
	Memory: 24.934 GiB/s

	Math: 17218.4 GFLOP/s
	```

	Note, the arguments which appear in the output may be used as command line parameters for subsequent invocations.


	## Example Tensor Core GEMM Operations

	To execute kernels targeting Tensor Core operations, supply the flag `--op_class=tensorop` in the command line.
	```bash
	$ ./tools/profiler/cutlass_profiler --op_class=tensorop --m=3456 --n=4096 --k=8192



	=============================
	Problem ID: 1

	Provider: CUTLASS
	OperationKind: gemm
	Operation: cutlass_tensorop_s16816gemm_f16_256x128_32x3_nn_align8

	Status: Success
	Verification: ON
	Disposition: Passed

	cuBLAS: Passed

	Arguments: --m=3456 --n=4096 --k=8192 --A=f16:column --B=f16:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1 \
	--batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 --cta_k=32 --stages=3 --warps_m=4 \
	--warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024

	Bytes: 180355072 bytes
	FLOPs: 231956545536 flops

	Runtime: 0.98647 ms
	Memory: 170.272 GiB/s

	Math: 235138 GFLOP/s
	```

	## Covering the problem space

	All arguments may have single values or comma-delimited set of values. Integers may also be specified
	as an inclusive range with the following syntax `start:end:increment` or simply `start:end`.

	For example, the following sweeps over the range of the GEMM K dimension from 8 to 4096 in increments
	of 8 elements.

	```bash
	$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn --m=4352 --n=4096 --k=8:4096:8
	```

	## Output

	By default, runtime and computed GFLOP/s are reported for each operation and problem size. Additionally,
	a table of comma separated values are reported at the end of the execution. This may be output to a file
	with the `--output=<filename.csv>` command line option as shown:

	```bash
	$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
	--m=3456 --n=4096 --k=8:4096:8 --output=report.csv
	```

	To faclitate generation of pivot tables and charts, additional columns may be prepended with the
	`--tags=<column>:<value>` option. One or more tags may be specified using a comma-delimited list.

	```bash
	$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
	--m=3456 --n=4096 --k=8:4096:8 --output=report.csv \
	--tags=cutlass:2.2,date:2020-06-08
	```

	# Convolution

	The CUTLASS Profiler is capable of executing 2-D and 3-D convolution problems for forwards and backwards
	operator variants.

	The CUTLASS Profiler can be built with cuDNN enabled to use as a reference implementation. If CMake detects
	the cuDNN library available in the system, it is included as a dependency. This may be explicitly overridden
	with CMake flag `CUTLASS_ENABLE_CUDNN`.

	```bash
	$ cmake .. -DCUTLASS_LIBRARY_OPERATIONS=conv2d -DCUTLASS_ENABLE_CUDNN=OFF
	...
	$ make -j16 cutlass_profiler
	```


	## Convolution Arguments

	```bash
	$ ./tools/profiler/cutlass_profiler --help --operation=Conv2d

	Conv2d

	[enum] --conv_kind Convolutional operator (fprop, dgrad, wgrad)
	[int] --n,--input_n Input N dimension of the Conv2d problem space
	[int] --h,--input_h Input H dimension of the Conv2d problem space
	[int] --w,--input_w Input W dimension of the Conv2d problem space
	[int] --c,--input_c Input C dimension of the Conv2d problem space
	[int] --k,--filter_k Filter K dimension of the Conv2d problem space
	[int] --r,--filter_r Filter R dimension of the Conv2d problem space
	[int] --s,--filter_s Filter S dimension of the Conv2d problem space
	[int] --p,--output_p Output P dimension of the Conv2d problem space
	[int] --q,--output_q Output Q dimension of the Conv2d problem space
	[int] --pad_h Padding in H direction
	[int] --pad_w Padding in W direction
	[int] --stride_h Stride in H direction
	[int] --stride_w Stride in W direction
	[int] --dilation_h Dilation in H direction
	[int] --dilation_w Dilation in W direction
	[tensor] --Activation Tensor storing the Activation operand
	[tensor] --Filter Tensor storing the Filter operand
	[tensor] --Output Tensor storing the Output operand
	[enum] --conv_mode Convolution filter mode (conv, cross)
	[enum] --iterator_algorithm,--iterator_algo Convolution iterator algorithm (analytic, optimized)
	[scalar] --alpha,--epilogue::alpha Epilogue scalar alpha
	[scalar] --beta,--epilogue::beta Epilogue scalar beta
	[enum] --split_k_mode,--split-k-mode SplitK mode for serial or parallel reduction (serial, parallel)
	[int] --split_k_slices,--split-k-slices Number of partitions of K dimension
	[enum] --eq_gemm_provider,--eq-gemm-provider Enable profiling equivalent gemm by the following providers (cutlass)
	[enum] --op_class,--opcode-class Class of math instruction (simt, tensorop, wmmatensorop, wmma)
	[enum] --accum,--accumulator-type Math instruction accumulator data type
	[int] --cta_m,--threadblock-shape::m Threadblock shape in the M dimension
	[int] --cta_n,--threadblock-shape::n Threadblock shape in the N dimension
	[int] --cta_k,--threadblock-shape::k Threadblock shape in the K dimension
	[int] --stages,--threadblock-stages Number of stages of threadblock-scoped matrix multiply
	[int] --warps_m,--warp-count::m Number of warps within threadblock along the M dimension
	[int] --warps_n,--warp-count::n Number of warps within threadblock along the N dimension
	[int] --warps_k,--warp-count::k Number of warps within threadblock along the K dimension
	[int] --inst_m,--instruction-shape::m Math instruction shape in the M dimension
	[int] --inst_n,--instruction-shape::n Math instruction shape in the N dimension
	[int] --inst_k,--instruction-shape::k Math instruction shape in the K dimension
	[int] --min_cc,--minimum-compute-capability Minimum device compute capability
	[int] --max_cc,--maximum-compute-capability Maximum device compute capability

	Examples:

	Profile a particular convolution (specify all the convolution parameters):

	$ cutlass_profiler --operation=Conv2d --Activation=f16:nhwc \
	--Filter=f16:nhwc --Output=f16 --accumulator-type=f32 \
	--n=32 --h=14 --w=14 --c=8 --k=64 --r=3 --s=3 \
	--pad_h=1 --pad_w=1 \
	--stride::h=1 --stride::w=1 --dilation::h=1 --dilation::w=1

	```

	## Example CUDA Core Convolution Operation

	Example command line for profiling forward propagation convolution kernels on CUDA cores is as follows:
	```bash
	$ ./tools/profiler/cutlass_profiler --kernels=simt_sfprop --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3


	=============================
	Problem ID: 1

	Provider: CUTLASS
	OperationKind: conv2d
	Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc

	Status: Success
	Verification: ON
	Disposition: Passed

	reference_device: Passed

	Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
	--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc \
	--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
	--eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \
	--warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024

	Bytes: 2055798784 bytes
	FLOPs: 118482796544 flops

	Runtime: 8.13237 ms
	Memory: 235.431 GiB/s

	Math: 14569.3 GFLOP/s

	```

	## Example Tensor Core Convolution Operation

	Example command line for profiling forward propagation convolution kernels runing on Tensor Cores is as follows:
	```bash
	$ ./tools/profiler/cutlass_profiler --kernels=tensorop*fprop --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3



	=============================
	Problem ID: 1

	Provider: CUTLASS
	OperationKind: conv2d
	Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_64x4_nhwc

	Status: Success
	Verification: ON
	Disposition: Passed

	reference_device: Passed

	Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
	--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc \
	--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
	--eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=64 --stages=4 \
	--warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024

	Bytes: 1130659840 bytes
	FLOPs: 118482796544 flops

	Runtime: 0.945071 ms
	Memory: 1114.21 GiB/s

	Math: 125369 GFLOP/s


	```

	# Copyright

	Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
	SPDX-License-Identifier: BSD-3-Clause

	```
	Redistribution and use in source and binary forms, with or without
	modification, are permitted provided that the following conditions are met:

	1. Redistributions of source code must retain the above copyright notice, this
	list of conditions and the following disclaimer.

	2. Redistributions in binary form must reproduce the above copyright notice,
	this list of conditions and the following disclaimer in the documentation
	and/or other materials provided with the distribution.

	3. Neither the name of the copyright holder nor the names of its
	contributors may be used to endorse or promote products derived from
	this software without specific prior written permission.

	THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
	AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
	IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
	DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
	FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
	DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
	SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
	CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
	OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
	OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
	```