feat: replace triton do_bench with torch.profiler for kernel timing

7d51e61 about 1 month ago

5.53 kB

	# Benchmark System

	## Overview

	벤치마크 시스템은 커스텀 CUDA 커널의 forward/backward 성능을 naive PyTorch 구현 대비 측정한다. Triton의 `do_bench`를 사용하며, 정확도 검증(correctness check) 후 성능을 측정한다.

	## Directory Structure

	```
	benchmarks/
	├── run_cases.py # CLI 진입점
	├── common/
	│ ├── bench_framework.py # 벤치마크 유틸리티 (Triton perf_report 기반)
	│ └── diff_engine.py # 정확도 검증 엔진 (DiffCase ABC)
	├── cases/ # 벤치마크 케이스 구현
	│ ├── rms.py # RMSNorm
	│ ├── add_rms.py # Fused Add + RMSNorm
	│ ├── poly.py # PolyNorm
	│ ├── mul_poly.py # Fused Mul + PolyNorm
	│ └── grouped_mul_poly.py # Grouped MoE Fused Mul + PolyNorm
	├── benchmark.yaml # Kubeflow 벤치마크 job config
	├── test.yaml # Kubeflow 테스트 job config
	├── plots/ # 생성된 플롯 결과
	└── results/ # 타임스탬프별 벤치마크 결과
	```

	## Usage

	```bash
	python benchmarks/run_cases.py --case <CASE> [OPTIONS]
	```

	### Arguments

	\| Argument \| Default \| Choices \| Description \|
	\|----------\|---------\|---------\|-------------\|
	\| `--case` \| (필수) \| `rms`, `add_rms`, `poly`, `mul_poly`, `grouped_mul_poly` \| 벤치마크 케이스 \|
	\| `--dtype` \| `bf16` \| `fp16`, `bf16`, `fp32`, `all` \| 데이터 타입 \|
	\| `--save-path` \| `./configs/` \| 경로 \| 결과 출력 디렉토리 \|
	\| `--plot` \| false \| - \| 플롯 생성 모드 \|
	\| `--profile` \| false \| - \| Chrome trace 프로파일링 내보내기 \|

	### Examples

	```bash
	# bf16 기본 벤치마크
	python benchmarks/run_cases.py --case grouped_mul_poly

	# 모든 dtype + 프로파일링
	python benchmarks/run_cases.py --case mul_poly --dtype all --profile --save-path ./results

	# 플롯만 생성
	python benchmarks/run_cases.py --case rms --plot --save-path ./plots
	```

	## Benchmark Cases

	각 케이스는 `DiffCase` ABC를 구현하며, naive(PyTorch 참조)와 CUDA 커널을 비교한다.

	\| Case \| Naive \| CUDA \| Inputs \|
	\|------\|-------\|------\|--------\|
	\| `rms` \| `torch.nn.RMSNorm` \| `activation.layers.RMSNorm` \| x, weight, eps \|
	\| `add_rms` \| custom `FusedAddRMSNorm` \| `activation.layers.FusedAddRMSNorm` \| x, residual, weight, eps \|
	\| `poly` \| custom `PolyNorm` (x^3, x^2, x 조합) \| `activation.layers.PolyNorm` \| x, weight(3), bias(1), eps \|
	\| `mul_poly` \| custom `FusedMulPolyNorm` \| `activation.layers.FusedMulPolyNorm` \| x, mul, weight(3), bias, eps \|
	\| `grouped_mul_poly` \| `fused_mul_grouped_poly_norm_ref` \| `fused_mul_grouped_poly_norm` \| x, mul, weight(num_experts, 3), bias, offsets \|

	`grouped_mul_poly`는 추가로 `compiled`(torch.compile된 naive)와 `compiled_cuda`(torch.compile된 CUDA) provider도 측정한다.

	## Execution Flow

	1. 정확도 검증 - 3개 config에 대해 `calculate_diff()` 실행
	- `(bs=2, sl=128, hidden=4096)`
	- `(bs=8, sl=4096, hidden=1280)`
	- `(bs=1, sl=32768, hidden=1280)`
	- forward/backward 모두 `atol=1e-2, rtol=1e-2`로 비교
	2. 벤치마크 실행 - dtype별로 forward/backward 성능 측정
	3. 결과 저장 - CSV 파일 (및 선택적으로 플롯/trace)

	## Configuration Ranges

	Standard cases (rms, add_rms, poly, mul_poly):
	- Batch sizes: 1, 2, 4, 8
	- Sequence lengths: 1024, 2048, 4096, 8192
	- Hidden dims: 2048, 4096

	Grouped case (grouped_mul_poly):
	- Total tokens: 1024 ~ 65536 (bs x sl)
	- Hidden dim: 1280 (고정)
	- Experts: 48 per rank

	`--plot` 모드에서는 `bs=1`로 고정하고 seq_len만 sweep한다.

	## Output

	### CSV

	`{save_path}/{case}/{dtype}/` 디렉토리에 저장:

	- `{case}-{dtype}-fwd-perf.csv` - forward 결과
	- `{case}-{dtype}-bwd-perf.csv` - backward 결과

	컬럼: `dim`, `batch_size`, `seq_len`, `Naive (us)`, `Compiled (us)`, `Cuda (us)`, `SpeedUp (us)`

	### Chrome Trace (`--profile`)

	`{save_path}/{case}/{dtype}/traces/` 디렉토리에 JSON 형식으로 저장. `chrome://tracing`에서 로드하여 GPU 타임라인을 분석할 수 있다.

	파일명 패턴: `trace_{fwd\|bwd}_{naive\|compiled\|cuda\|compiled_cuda}_N{total_tokens}.json`

	### Plot (`--plot`)

	Speedup 비교 플롯 생성. Geometric mean으로 전체 speedup을 집계한다.

	## Framework Internals

	### bench_framework.py

	Triton의 `perf_report`/`Benchmark`를 사용하는 4개 팩토리 함수:

	- `make_fwd_benchmark_for_case()` - forward 벤치마크 (CSV)
	- `make_bwd_benchmark_for_case()` - backward 벤치마크 (CSV)
	- `make_fwd_benchmark_plot_for_case()` - forward 플롯
	- `make_bwd_benchmark_plot_for_case()` - backward 플롯

	타이밍은 `triton.testing.do_bench()`로 측정하며, ms 단위를 us로 변환한다 (`time_unit_scale=1000`).

	### diff_engine.py

	`DiffCase` ABC 인터페이스:

	- `build_inputs(bs, sl, dim)` - 입력 텐서 생성
	- `make_naive()` / `make_cuda()` - 구현체 생성
	- `forward(module, inputs)` - forward 실행
	- `grad_inputs(inputs)` - gradient 대상 텐서 반환

	`calculate_diff()`가 naive와 CUDA 양쪽의 forward output + backward gradient를 `torch.testing.assert_close()`로 비교한다.

	## Kubeflow Integration

	`benchmark.yaml`로 클러스터에서 벤치마크를 실행할 수 있다:

	- triton, matplotlib, pandas 설치
	- C++ extension 빌드 (`setup.py`)
	- GPU warmup (100 iterations matmul)
	- 결과를 `benchmarks/results/{YY_MM_DD_HH_MM}/`에 저장