File size: 6,575 Bytes
29b9c56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
python profile.py -f -j 1
Type Time (%) Time (ms)  Calls       Avg       Min       Max                                               Name
GPU activities:    25.49    11.801     60  196.68us  158.82us  243.43us  void spatialDepthwiseConvolutionUpdateOutput<f...
GPU activities:    13.21     6.117    255  23.988us     992ns  2.6960ms                                 [CUDA memcpy DtoH]
GPU activities:    12.44    5.7582     60  95.970us  94.145us  98.721us  void kernelPointwiseApply2<TensorTakeOp<float,...
GPU activities:     8.65    4.0074    100  40.073us  1.7280us  99.490us  void kernelPointwiseApply3<TensorAddOp<long>, ...
GPU activities:     6.95    3.2163    120  26.802us  25.440us  28.544us  void indexSelectSmallIndex<float, unsigned int...
GPU activities:     6.71    3.1092     90  34.546us  20.096us  57.025us  void kernelPointwiseApply2<CopyOp<float, float...
GPU activities:      5.8    2.6862    140  19.186us     864ns  583.46us                                 [CUDA memcpy HtoD]
GPU activities:     3.66    1.6945     60  28.241us  15.296us  66.369us  void kernelPointwiseApply3<TensorAddOp<float>,...
GPU activities:     3.11    1.4384     30  47.946us  43.745us  49.376us  void kernelPointwiseApply2<TensorDivConstantOp...
GPU activities:     3.04    1.4067     90  15.630us  11.328us  27.552us  void kernelPointwiseApply2<CopyOp<float, float...
GPU activities:     2.08   0.96206     30  32.068us  28.800us  35.840us  void kernelPointwiseApply3<TensorAddOp<float>,...
GPU activities:     1.93   0.89537     30  29.845us  28.608us  31.297us  void kernelPointwiseApply3<TensorSubOp<float>,...
GPU activities:     1.76   0.81451     15  54.300us  53.377us  55.361us  void indexSelectSmallIndex<float, unsigned int...
GPU activities:      1.7   0.78772     10  78.772us  67.008us  90.689us  void kernelReduceAllPass1<float, unsigned int,...
GPU activities:     0.52   0.23917     60  3.9860us  3.8410us  4.1920us  void kernelReduceAll<long, unsigned int, long,...
GPU activities:     0.51   0.23808     15  15.872us  15.616us  16.128us  void kernelPointwiseApply3<TensorSubOp<float>,...
GPU activities:     0.51   0.23428     60  3.9040us  3.8080us  4.1280us  void kernelReduceAll<long, unsigned int, long,...
GPU activities:     0.34   0.15543     15  10.361us  10.080us  10.625us  void kernelPointwiseApply2<Tensor_neg_Float_Op...
GPU activities:     0.32   0.14679    120  1.2230us     928ns  1.5050us  void kernelPointwiseApply2<TensorMulConstantOp...
GPU activities:     0.27    0.1241    100  1.2400us     800ns  1.9840us  void thrust::cuda_cub::core::_kernel_agent<thr...
GPU activities:     0.26   0.11875    120     989ns     768ns  1.2800us  void kernelPointwiseApply1<TensorFillOp<long>,...
GPU activities:     0.25   0.11347     15  7.5640us     832ns  20.865us  void kernelPointwiseApply1<TensorFillOp<float>...
GPU activities:     0.21   0.09632     60  1.6050us  1.5680us  1.6960us  void kernelPointwiseApply2<TensorRemainderOp<l...
GPU activities:      0.2  0.090593     60  1.5090us  1.3440us  2.4330us  void kernelPointwiseApply2<CopyOp<float, float...
GPU activities:     0.06  0.026977     10  2.6970us  2.2080us  3.2330us  void kernelReduceAllPass2<float, ReduceAdd<flo...
GPU activities:     0.03  0.013633     15     908ns     832ns  1.0880us                                      [CUDA memset]
GPU activities:     0.03  0.011872     10  1.1870us     960ns  1.5040us  void kernelPointwiseApply1<TensorDivConstantOp...
         Total:   100.04    46.304   1750                                                                                 
Total (no mem):    81.03   37.5008   1355                                                                                 
     API calls:    95.36   4330.04     13  333.08ms  16.022us  4.32040s                                         cudaMalloc
     API calls:        3    136.03      1  136.03ms  136.03ms  136.03ms                              cudaDeviceSynchronize
     API calls:     0.51    23.082   1340  17.225us  5.7790us  136.37us                                         cudaLaunch
     API calls:     0.45    20.316    395  51.431us  5.1510us  4.4311ms                                    cudaMemcpyAsync
     API calls:     0.18    8.1313  15642     519ns     254ns  18.543us                                      cudaGetDevice
     API calls:     0.11    5.0693    185  27.401us     126ns  1.1465ms                               cuDeviceGetAttribute
     API calls:     0.11    5.0058      2  2.5029ms  2.4990ms  2.5068ms                            cudaGetDeviceProperties
     API calls:     0.11    4.9763    395  12.598us  1.6840us  223.54us                              cudaStreamSynchronize
     API calls:     0.07    3.0304   4947     612ns     279ns  15.640us                                      cudaSetDevice
     API calls:     0.06    2.5076   6885     364ns     109ns  705.44us                                  cudaSetupArgument
     API calls:     0.01    0.5482    100  5.4820us  2.8720us  15.523us                              cudaFuncGetAttributes
     API calls:     0.01   0.48326      2  241.63us  240.94us  242.32us                                    cuDeviceGetName
     API calls:     0.01   0.45813   1340     341ns     146ns  2.5300us                                  cudaConfigureCall
     API calls:     0.01   0.39718   1290     307ns     120ns  2.0540us                                   cudaGetLastError
     API calls:     0.01   0.39126     15  26.083us  3.9020us  133.31us                                    cudaMemsetAsync
     API calls:     0.01   0.29644      2  148.22us  145.40us  151.04us                                   cuDeviceTotalMem
     API calls:        0  0.091542    100     915ns     419ns  1.8570us                             cudaDeviceGetAttribute
     API calls:        0  0.053599    200     267ns     105ns     581ns                                cudaPeekAtLastError
     API calls:        0  0.004603     14     328ns     102ns  1.3770us                                 cudaGetDeviceCount
     API calls:        0  0.001982      4     495ns     140ns  1.3530us                                   cuDeviceGetCount
     API calls:        0  0.001158      3     386ns     129ns     841ns                                        cuDeviceGet
     API calls:        0  0.000678      1     678ns     678ns     678ns                                             cuInit
     API calls:        0   0.00047      1     470ns     470ns     470ns                                 cuDriverGetVersion
         Total:   100.02   4540.92  32877