Kernels
activation / docs /benchmark.md
wyldecat's picture
feat: replace triton do_bench with torch.profiler for kernel timing
7d51e61
# Benchmark System
## Overview
๋ฒค์น˜๋งˆํฌ ์‹œ์Šคํ…œ์€ ์ปค์Šคํ…€ CUDA ์ปค๋„์˜ forward/backward ์„ฑ๋Šฅ์„ naive PyTorch ๊ตฌํ˜„ ๋Œ€๋น„ ์ธก์ •ํ•œ๋‹ค. Triton์˜ `do_bench`๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ •ํ™•๋„ ๊ฒ€์ฆ(correctness check) ํ›„ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•œ๋‹ค.
## Directory Structure
```
benchmarks/
โ”œโ”€โ”€ run_cases.py # CLI ์ง„์ž…์ 
โ”œโ”€โ”€ common/
โ”‚ โ”œโ”€โ”€ bench_framework.py # ๋ฒค์น˜๋งˆํฌ ์œ ํ‹ธ๋ฆฌํ‹ฐ (Triton perf_report ๊ธฐ๋ฐ˜)
โ”‚ โ””โ”€โ”€ diff_engine.py # ์ •ํ™•๋„ ๊ฒ€์ฆ ์—”์ง„ (DiffCase ABC)
โ”œโ”€โ”€ cases/ # ๋ฒค์น˜๋งˆํฌ ์ผ€์ด์Šค ๊ตฌํ˜„
โ”‚ โ”œโ”€โ”€ rms.py # RMSNorm
โ”‚ โ”œโ”€โ”€ add_rms.py # Fused Add + RMSNorm
โ”‚ โ”œโ”€โ”€ poly.py # PolyNorm
โ”‚ โ”œโ”€โ”€ mul_poly.py # Fused Mul + PolyNorm
โ”‚ โ””โ”€โ”€ grouped_mul_poly.py # Grouped MoE Fused Mul + PolyNorm
โ”œโ”€โ”€ benchmark.yaml # Kubeflow ๋ฒค์น˜๋งˆํฌ job config
โ”œโ”€โ”€ test.yaml # Kubeflow ํ…Œ์ŠคํŠธ job config
โ”œโ”€โ”€ plots/ # ์ƒ์„ฑ๋œ ํ”Œ๋กฏ ๊ฒฐ๊ณผ
โ””โ”€โ”€ results/ # ํƒ€์ž„์Šคํƒฌํ”„๋ณ„ ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ
```
## Usage
```bash
python benchmarks/run_cases.py --case <CASE> [OPTIONS]
```
### Arguments
| Argument | Default | Choices | Description |
|----------|---------|---------|-------------|
| `--case` | (ํ•„์ˆ˜) | `rms`, `add_rms`, `poly`, `mul_poly`, `grouped_mul_poly` | ๋ฒค์น˜๋งˆํฌ ์ผ€์ด์Šค |
| `--dtype` | `bf16` | `fp16`, `bf16`, `fp32`, `all` | ๋ฐ์ดํ„ฐ ํƒ€์ž… |
| `--save-path` | `./configs/` | ๊ฒฝ๋กœ | ๊ฒฐ๊ณผ ์ถœ๋ ฅ ๋””๋ ‰ํ† ๋ฆฌ |
| `--plot` | false | - | ํ”Œ๋กฏ ์ƒ์„ฑ ๋ชจ๋“œ |
| `--profile` | false | - | Chrome trace ํ”„๋กœํŒŒ์ผ๋ง ๋‚ด๋ณด๋‚ด๊ธฐ |
### Examples
```bash
# bf16 ๊ธฐ๋ณธ ๋ฒค์น˜๋งˆํฌ
python benchmarks/run_cases.py --case grouped_mul_poly
# ๋ชจ๋“  dtype + ํ”„๋กœํŒŒ์ผ๋ง
python benchmarks/run_cases.py --case mul_poly --dtype all --profile --save-path ./results
# ํ”Œ๋กฏ๋งŒ ์ƒ์„ฑ
python benchmarks/run_cases.py --case rms --plot --save-path ./plots
```
## Benchmark Cases
๊ฐ ์ผ€์ด์Šค๋Š” `DiffCase` ABC๋ฅผ ๊ตฌํ˜„ํ•˜๋ฉฐ, naive(PyTorch ์ฐธ์กฐ)์™€ CUDA ์ปค๋„์„ ๋น„๊ตํ•œ๋‹ค.
| Case | Naive | CUDA | Inputs |
|------|-------|------|--------|
| `rms` | `torch.nn.RMSNorm` | `activation.layers.RMSNorm` | x, weight, eps |
| `add_rms` | custom `FusedAddRMSNorm` | `activation.layers.FusedAddRMSNorm` | x, residual, weight, eps |
| `poly` | custom `PolyNorm` (x^3, x^2, x ์กฐํ•ฉ) | `activation.layers.PolyNorm` | x, weight(3), bias(1), eps |
| `mul_poly` | custom `FusedMulPolyNorm` | `activation.layers.FusedMulPolyNorm` | x, mul, weight(3), bias, eps |
| `grouped_mul_poly` | `fused_mul_grouped_poly_norm_ref` | `fused_mul_grouped_poly_norm` | x, mul, weight(num_experts, 3), bias, offsets |
`grouped_mul_poly`๋Š” ์ถ”๊ฐ€๋กœ `compiled`(torch.compile๋œ naive)์™€ `compiled_cuda`(torch.compile๋œ CUDA) provider๋„ ์ธก์ •ํ•œ๋‹ค.
## Execution Flow
1. **์ •ํ™•๋„ ๊ฒ€์ฆ** - 3๊ฐœ config์— ๋Œ€ํ•ด `calculate_diff()` ์‹คํ–‰
- `(bs=2, sl=128, hidden=4096)`
- `(bs=8, sl=4096, hidden=1280)`
- `(bs=1, sl=32768, hidden=1280)`
- forward/backward ๋ชจ๋‘ `atol=1e-2, rtol=1e-2`๋กœ ๋น„๊ต
2. **๋ฒค์น˜๋งˆํฌ ์‹คํ–‰** - dtype๋ณ„๋กœ forward/backward ์„ฑ๋Šฅ ์ธก์ •
3. **๊ฒฐ๊ณผ ์ €์žฅ** - CSV ํŒŒ์ผ (๋ฐ ์„ ํƒ์ ์œผ๋กœ ํ”Œ๋กฏ/trace)
## Configuration Ranges
**Standard cases** (rms, add_rms, poly, mul_poly):
- Batch sizes: 1, 2, 4, 8
- Sequence lengths: 1024, 2048, 4096, 8192
- Hidden dims: 2048, 4096
**Grouped case** (grouped_mul_poly):
- Total tokens: 1024 ~ 65536 (bs x sl)
- Hidden dim: 1280 (๊ณ ์ •)
- Experts: 48 per rank
`--plot` ๋ชจ๋“œ์—์„œ๋Š” `bs=1`๋กœ ๊ณ ์ •ํ•˜๊ณ  seq_len๋งŒ sweepํ•œ๋‹ค.
## Output
### CSV
`{save_path}/{case}/{dtype}/` ๋””๋ ‰ํ† ๋ฆฌ์— ์ €์žฅ:
- `{case}-{dtype}-fwd-perf.csv` - forward ๊ฒฐ๊ณผ
- `{case}-{dtype}-bwd-perf.csv` - backward ๊ฒฐ๊ณผ
์ปฌ๋Ÿผ: `dim`, `batch_size`, `seq_len`, `Naive (us)`, `Compiled (us)`, `Cuda (us)`, `SpeedUp (us)`
### Chrome Trace (`--profile`)
`{save_path}/{case}/{dtype}/traces/` ๋””๋ ‰ํ† ๋ฆฌ์— JSON ํ˜•์‹์œผ๋กœ ์ €์žฅ. `chrome://tracing`์—์„œ ๋กœ๋“œํ•˜์—ฌ GPU ํƒ€์ž„๋ผ์ธ์„ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค.
ํŒŒ์ผ๋ช… ํŒจํ„ด: `trace_{fwd|bwd}_{naive|compiled|cuda|compiled_cuda}_N{total_tokens}.json`
### Plot (`--plot`)
Speedup ๋น„๊ต ํ”Œ๋กฏ ์ƒ์„ฑ. Geometric mean์œผ๋กœ ์ „์ฒด speedup์„ ์ง‘๊ณ„ํ•œ๋‹ค.
## Framework Internals
### bench_framework.py
Triton์˜ `perf_report`/`Benchmark`๋ฅผ ์‚ฌ์šฉํ•˜๋Š” 4๊ฐœ ํŒฉํ† ๋ฆฌ ํ•จ์ˆ˜:
- `make_fwd_benchmark_for_case()` - forward ๋ฒค์น˜๋งˆํฌ (CSV)
- `make_bwd_benchmark_for_case()` - backward ๋ฒค์น˜๋งˆํฌ (CSV)
- `make_fwd_benchmark_plot_for_case()` - forward ํ”Œ๋กฏ
- `make_bwd_benchmark_plot_for_case()` - backward ํ”Œ๋กฏ
ํƒ€์ด๋ฐ์€ `triton.testing.do_bench()`๋กœ ์ธก์ •ํ•˜๋ฉฐ, ms ๋‹จ์œ„๋ฅผ us๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค (`time_unit_scale=1000`).
### diff_engine.py
`DiffCase` ABC ์ธํ„ฐํŽ˜์ด์Šค:
- `build_inputs(bs, sl, dim)` - ์ž…๋ ฅ ํ…์„œ ์ƒ์„ฑ
- `make_naive()` / `make_cuda()` - ๊ตฌํ˜„์ฒด ์ƒ์„ฑ
- `forward(module, inputs)` - forward ์‹คํ–‰
- `grad_inputs(inputs)` - gradient ๋Œ€์ƒ ํ…์„œ ๋ฐ˜ํ™˜
`calculate_diff()`๊ฐ€ naive์™€ CUDA ์–‘์ชฝ์˜ forward output + backward gradient๋ฅผ `torch.testing.assert_close()`๋กœ ๋น„๊ตํ•œ๋‹ค.
## Kubeflow Integration
`benchmark.yaml`๋กœ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค:
- triton, matplotlib, pandas ์„ค์น˜
- C++ extension ๋นŒ๋“œ (`setup.py`)
- GPU warmup (100 iterations matmul)
- ๊ฒฐ๊ณผ๋ฅผ `benchmarks/results/{YY_MM_DD_HH_MM}/`์— ์ €์žฅ