TUHs's picture
Upload 207 files
29b9c56
python profile.py -f -j 1
Type Time (%) Time (ms) Calls Avg Min Max Name
GPU activities: 24.41 2.6361 51 51.688us 992ns 2.5829ms [CUDA memcpy DtoH]
GPU activities: 21.86 2.3607 12 196.73us 159.14us 242.76us void spatialDepthwiseConvolutionUpdateOutput<f...
GPU activities: 10.65 1.1503 12 95.859us 94.273us 97.985us void kernelPointwiseApply2<TensorTakeOp<float,...
GPU activities: 7.01 0.75732 20 37.866us 1.6960us 98.817us void kernelPointwiseApply3<TensorAddOp<long>, ...
GPU activities: 6.6 0.71246 28 25.444us 896ns 686.67us [CUDA memcpy HtoD]
GPU activities: 5.92 0.63933 24 26.638us 25.536us 28.513us void indexSelectSmallIndex<float, unsigned int...
GPU activities: 5.74 0.6202 18 34.455us 20.032us 56.449us void kernelPointwiseApply2<CopyOp<float, float...
GPU activities: 3.13 0.33792 12 28.160us 15.264us 66.112us void kernelPointwiseApply3<TensorAddOp<float>,...
GPU activities: 2.66 0.28752 6 47.920us 44.064us 48.961us void kernelPointwiseApply2<TensorDivConstantOp...
GPU activities: 2.6 0.28106 18 15.614us 11.424us 27.105us void kernelPointwiseApply2<CopyOp<float, float...
GPU activities: 1.78 0.19191 6 31.984us 28.800us 35.137us void kernelPointwiseApply3<TensorAddOp<float>,...
GPU activities: 1.66 0.17917 6 29.861us 28.641us 31.137us void kernelPointwiseApply3<TensorSubOp<float>,...
GPU activities: 1.52 0.16375 3 54.581us 53.984us 55.360us void indexSelectSmallIndex<float, unsigned int...
GPU activities: 1.47 0.15872 2 79.361us 68.353us 90.369us void kernelReduceAllPass1<float, unsigned int,...
GPU activities: 0.47 0.051264 12 4.2720us 4.1280us 4.3840us void kernelReduceAll<long, unsigned int, long,...
GPU activities: 0.44 0.047393 3 15.797us 15.681us 15.872us void kernelPointwiseApply3<TensorSubOp<float>,...
GPU activities: 0.43 0.046304 12 3.8580us 3.8080us 3.9680us void kernelReduceAll<long, unsigned int, long,...
GPU activities: 0.29 0.03104 3 10.346us 10.272us 10.432us void kernelPointwiseApply2<Tensor_neg_Float_Op...
GPU activities: 0.28 0.029984 24 1.2490us 960ns 1.5040us void kernelPointwiseApply2<TensorMulConstantOp...
GPU activities: 0.22 0.023744 20 1.1870us 832ns 2.0160us void thrust::cuda_cub::core::_kernel_agent<thr...
GPU activities: 0.21 0.023072 24 961ns 768ns 1.1520us void kernelPointwiseApply1<TensorFillOp<long>,...
GPU activities: 0.21 0.022561 3 7.5200us 800ns 20.609us void kernelPointwiseApply1<TensorFillOp<float>...
GPU activities: 0.18 0.019488 12 1.6240us 1.5680us 1.7920us void kernelPointwiseApply2<TensorRemainderOp<l...
GPU activities: 0.17 0.018048 12 1.5040us 1.4720us 1.6000us void kernelPointwiseApply2<CopyOp<float, float...
GPU activities: 0.05 0.005376 2 2.6880us 2.1760us 3.2000us void kernelReduceAllPass2<float, ReduceAdd<flo...
GPU activities: 0.02 0.002688 3 896ns 864ns 928ns [CUDA memset]
GPU activities: 0.02 0.002304 2 1.1520us 960ns 1.3440us void kernelPointwiseApply1<TensorDivConstantOp...
Total: 100 10.7997 350
Total (no mem): 68.99 7.45117 271
API calls: 97.25 4322.41 12 360.20ms 15.643us 4.31414s cudaMalloc
API calls: 2.12 94.343 1 94.343ms 94.343ms 94.343ms cudaDeviceSynchronize
API calls: 0.15 6.5229 79 82.568us 4.6270us 4.2232ms cudaMemcpyAsync
API calls: 0.13 5.8286 185 31.506us 206ns 1.3966ms cuDeviceGetAttribute
API calls: 0.13 5.7194 2 2.8597ms 2.6896ms 3.0298ms cudaGetDeviceProperties
API calls: 0.1 4.346 268 16.216us 5.7400us 106.02us cudaLaunch
API calls: 0.04 1.7618 3130 562ns 255ns 35.853us cudaGetDevice
API calls: 0.02 1.098 79 13.898us 1.6890us 209.45us cudaStreamSynchronize
API calls: 0.02 0.82276 2 411.38us 217.99us 604.77us cuDeviceTotalMem
API calls: 0.01 0.6179 991 623ns 279ns 6.7660us cudaSetDevice
API calls: 0.01 0.54713 2 273.57us 260.93us 286.20us cuDeviceGetName
API calls: 0.01 0.37292 1377 270ns 110ns 1.8720us cudaSetupArgument
API calls: 0 0.11578 20 5.7890us 2.8300us 13.631us cudaFuncGetAttributes
API calls: 0 0.092499 268 345ns 148ns 1.1260us cudaConfigureCall
API calls: 0 0.089808 258 348ns 122ns 6.9550us cudaGetLastError
API calls: 0 0.079046 3 26.348us 5.0360us 64.092us cudaMemsetAsync
API calls: 0 0.018829 20 941ns 449ns 1.5140us cudaDeviceGetAttribute
API calls: 0 0.010992 40 274ns 105ns 635ns cudaPeekAtLastError
API calls: 0 0.00889 14 635ns 253ns 1.5200us cudaGetDeviceCount
API calls: 0 0.002493 4 623ns 202ns 1.6440us cuDeviceGetCount
API calls: 0 0.001507 3 502ns 216ns 948ns cuDeviceGet
API calls: 0 0.001017 1 1.0170us 1.0170us 1.0170us cuInit
API calls: 0 0.000657 1 657ns 657ns 657ns cuDriverGetVersion
Total: 99.99 4444.81 6760