TUHs's picture
Upload 207 files
29b9c56
python profile.py -f -j 1
Type Time (%) Time (ms) Calls Avg Min Max Name
GPU activities: 25.49 11.801 60 196.68us 158.82us 243.43us void spatialDepthwiseConvolutionUpdateOutput<f...
GPU activities: 13.21 6.117 255 23.988us 992ns 2.6960ms [CUDA memcpy DtoH]
GPU activities: 12.44 5.7582 60 95.970us 94.145us 98.721us void kernelPointwiseApply2<TensorTakeOp<float,...
GPU activities: 8.65 4.0074 100 40.073us 1.7280us 99.490us void kernelPointwiseApply3<TensorAddOp<long>, ...
GPU activities: 6.95 3.2163 120 26.802us 25.440us 28.544us void indexSelectSmallIndex<float, unsigned int...
GPU activities: 6.71 3.1092 90 34.546us 20.096us 57.025us void kernelPointwiseApply2<CopyOp<float, float...
GPU activities: 5.8 2.6862 140 19.186us 864ns 583.46us [CUDA memcpy HtoD]
GPU activities: 3.66 1.6945 60 28.241us 15.296us 66.369us void kernelPointwiseApply3<TensorAddOp<float>,...
GPU activities: 3.11 1.4384 30 47.946us 43.745us 49.376us void kernelPointwiseApply2<TensorDivConstantOp...
GPU activities: 3.04 1.4067 90 15.630us 11.328us 27.552us void kernelPointwiseApply2<CopyOp<float, float...
GPU activities: 2.08 0.96206 30 32.068us 28.800us 35.840us void kernelPointwiseApply3<TensorAddOp<float>,...
GPU activities: 1.93 0.89537 30 29.845us 28.608us 31.297us void kernelPointwiseApply3<TensorSubOp<float>,...
GPU activities: 1.76 0.81451 15 54.300us 53.377us 55.361us void indexSelectSmallIndex<float, unsigned int...
GPU activities: 1.7 0.78772 10 78.772us 67.008us 90.689us void kernelReduceAllPass1<float, unsigned int,...
GPU activities: 0.52 0.23917 60 3.9860us 3.8410us 4.1920us void kernelReduceAll<long, unsigned int, long,...
GPU activities: 0.51 0.23808 15 15.872us 15.616us 16.128us void kernelPointwiseApply3<TensorSubOp<float>,...
GPU activities: 0.51 0.23428 60 3.9040us 3.8080us 4.1280us void kernelReduceAll<long, unsigned int, long,...
GPU activities: 0.34 0.15543 15 10.361us 10.080us 10.625us void kernelPointwiseApply2<Tensor_neg_Float_Op...
GPU activities: 0.32 0.14679 120 1.2230us 928ns 1.5050us void kernelPointwiseApply2<TensorMulConstantOp...
GPU activities: 0.27 0.1241 100 1.2400us 800ns 1.9840us void thrust::cuda_cub::core::_kernel_agent<thr...
GPU activities: 0.26 0.11875 120 989ns 768ns 1.2800us void kernelPointwiseApply1<TensorFillOp<long>,...
GPU activities: 0.25 0.11347 15 7.5640us 832ns 20.865us void kernelPointwiseApply1<TensorFillOp<float>...
GPU activities: 0.21 0.09632 60 1.6050us 1.5680us 1.6960us void kernelPointwiseApply2<TensorRemainderOp<l...
GPU activities: 0.2 0.090593 60 1.5090us 1.3440us 2.4330us void kernelPointwiseApply2<CopyOp<float, float...
GPU activities: 0.06 0.026977 10 2.6970us 2.2080us 3.2330us void kernelReduceAllPass2<float, ReduceAdd<flo...
GPU activities: 0.03 0.013633 15 908ns 832ns 1.0880us [CUDA memset]
GPU activities: 0.03 0.011872 10 1.1870us 960ns 1.5040us void kernelPointwiseApply1<TensorDivConstantOp...
Total: 100.04 46.304 1750
Total (no mem): 81.03 37.5008 1355
API calls: 95.36 4330.04 13 333.08ms 16.022us 4.32040s cudaMalloc
API calls: 3 136.03 1 136.03ms 136.03ms 136.03ms cudaDeviceSynchronize
API calls: 0.51 23.082 1340 17.225us 5.7790us 136.37us cudaLaunch
API calls: 0.45 20.316 395 51.431us 5.1510us 4.4311ms cudaMemcpyAsync
API calls: 0.18 8.1313 15642 519ns 254ns 18.543us cudaGetDevice
API calls: 0.11 5.0693 185 27.401us 126ns 1.1465ms cuDeviceGetAttribute
API calls: 0.11 5.0058 2 2.5029ms 2.4990ms 2.5068ms cudaGetDeviceProperties
API calls: 0.11 4.9763 395 12.598us 1.6840us 223.54us cudaStreamSynchronize
API calls: 0.07 3.0304 4947 612ns 279ns 15.640us cudaSetDevice
API calls: 0.06 2.5076 6885 364ns 109ns 705.44us cudaSetupArgument
API calls: 0.01 0.5482 100 5.4820us 2.8720us 15.523us cudaFuncGetAttributes
API calls: 0.01 0.48326 2 241.63us 240.94us 242.32us cuDeviceGetName
API calls: 0.01 0.45813 1340 341ns 146ns 2.5300us cudaConfigureCall
API calls: 0.01 0.39718 1290 307ns 120ns 2.0540us cudaGetLastError
API calls: 0.01 0.39126 15 26.083us 3.9020us 133.31us cudaMemsetAsync
API calls: 0.01 0.29644 2 148.22us 145.40us 151.04us cuDeviceTotalMem
API calls: 0 0.091542 100 915ns 419ns 1.8570us cudaDeviceGetAttribute
API calls: 0 0.053599 200 267ns 105ns 581ns cudaPeekAtLastError
API calls: 0 0.004603 14 328ns 102ns 1.3770us cudaGetDeviceCount
API calls: 0 0.001982 4 495ns 140ns 1.3530us cuDeviceGetCount
API calls: 0 0.001158 3 386ns 129ns 841ns cuDeviceGet
API calls: 0 0.000678 1 678ns 678ns 678ns cuInit
API calls: 0 0.00047 1 470ns 470ns 470ns cuDriverGetVersion
Total: 100.02 4540.92 32877