python profile.py -f -j 1 Type Time (%) Time (ms) Calls Avg Min Max Name GPU activities: 24.41 2.6361 51 51.688us 992ns 2.5829ms [CUDA memcpy DtoH] GPU activities: 21.86 2.3607 12 196.73us 159.14us 242.76us void spatialDepthwiseConvolutionUpdateOutput, ... GPU activities: 6.6 0.71246 28 25.444us 896ns 686.67us [CUDA memcpy HtoD] GPU activities: 5.92 0.63933 24 26.638us 25.536us 28.513us void indexSelectSmallIndex,... GPU activities: 2.66 0.28752 6 47.920us 44.064us 48.961us void kernelPointwiseApply2,... GPU activities: 1.66 0.17917 6 29.861us 28.641us 31.137us void kernelPointwiseApply3,... GPU activities: 1.52 0.16375 3 54.581us 53.984us 55.360us void indexSelectSmallIndex,... GPU activities: 0.43 0.046304 12 3.8580us 3.8080us 3.9680us void kernelReduceAll,... GPU activities: 0.21 0.022561 3 7.5200us 800ns 20.609us void kernelPointwiseApply1... GPU activities: 0.18 0.019488 12 1.6240us 1.5680us 1.7920us void kernelPointwiseApply2