| # Benchmark System |
|
|
| ## Overview |
|
|
| ๋ฒค์น๋งํฌ ์์คํ
์ ์ปค์คํ
CUDA ์ปค๋์ forward/backward ์ฑ๋ฅ์ naive PyTorch ๊ตฌํ ๋๋น ์ธก์ ํ๋ค. Triton์ `do_bench`๋ฅผ ์ฌ์ฉํ๋ฉฐ, ์ ํ๋ ๊ฒ์ฆ(correctness check) ํ ์ฑ๋ฅ์ ์ธก์ ํ๋ค. |
|
|
| ## Directory Structure |
|
|
| ``` |
| benchmarks/ |
| โโโ run_cases.py # CLI ์ง์
์ |
| โโโ common/ |
| โ โโโ bench_framework.py # ๋ฒค์น๋งํฌ ์ ํธ๋ฆฌํฐ (Triton perf_report ๊ธฐ๋ฐ) |
| โ โโโ diff_engine.py # ์ ํ๋ ๊ฒ์ฆ ์์ง (DiffCase ABC) |
| โโโ cases/ # ๋ฒค์น๋งํฌ ์ผ์ด์ค ๊ตฌํ |
| โ โโโ rms.py # RMSNorm |
| โ โโโ add_rms.py # Fused Add + RMSNorm |
| โ โโโ poly.py # PolyNorm |
| โ โโโ mul_poly.py # Fused Mul + PolyNorm |
| โ โโโ grouped_mul_poly.py # Grouped MoE Fused Mul + PolyNorm |
| โโโ benchmark.yaml # Kubeflow ๋ฒค์น๋งํฌ job config |
| โโโ test.yaml # Kubeflow ํ
์คํธ job config |
| โโโ plots/ # ์์ฑ๋ ํ๋กฏ ๊ฒฐ๊ณผ |
| โโโ results/ # ํ์์คํฌํ๋ณ ๋ฒค์น๋งํฌ ๊ฒฐ๊ณผ |
| ``` |
|
|
| ## Usage |
|
|
| ```bash |
| python benchmarks/run_cases.py --case <CASE> [OPTIONS] |
| ``` |
|
|
| ### Arguments |
|
|
| | Argument | Default | Choices | Description | |
| |----------|---------|---------|-------------| |
| | `--case` | (ํ์) | `rms`, `add_rms`, `poly`, `mul_poly`, `grouped_mul_poly` | ๋ฒค์น๋งํฌ ์ผ์ด์ค | |
| | `--dtype` | `bf16` | `fp16`, `bf16`, `fp32`, `all` | ๋ฐ์ดํฐ ํ์
| |
| | `--save-path` | `./configs/` | ๊ฒฝ๋ก | ๊ฒฐ๊ณผ ์ถ๋ ฅ ๋๋ ํ ๋ฆฌ | |
| | `--plot` | false | - | ํ๋กฏ ์์ฑ ๋ชจ๋ | |
| | `--profile` | false | - | Chrome trace ํ๋กํ์ผ๋ง ๋ด๋ณด๋ด๊ธฐ | |
|
|
| ### Examples |
|
|
| ```bash |
| # bf16 ๊ธฐ๋ณธ ๋ฒค์น๋งํฌ |
| python benchmarks/run_cases.py --case grouped_mul_poly |
| |
| # ๋ชจ๋ dtype + ํ๋กํ์ผ๋ง |
| python benchmarks/run_cases.py --case mul_poly --dtype all --profile --save-path ./results |
| |
| # ํ๋กฏ๋ง ์์ฑ |
| python benchmarks/run_cases.py --case rms --plot --save-path ./plots |
| ``` |
|
|
| ## Benchmark Cases |
|
|
| ๊ฐ ์ผ์ด์ค๋ `DiffCase` ABC๋ฅผ ๊ตฌํํ๋ฉฐ, naive(PyTorch ์ฐธ์กฐ)์ CUDA ์ปค๋์ ๋น๊ตํ๋ค. |
|
|
| | Case | Naive | CUDA | Inputs | |
| |------|-------|------|--------| |
| | `rms` | `torch.nn.RMSNorm` | `activation.layers.RMSNorm` | x, weight, eps | |
| | `add_rms` | custom `FusedAddRMSNorm` | `activation.layers.FusedAddRMSNorm` | x, residual, weight, eps | |
| | `poly` | custom `PolyNorm` (x^3, x^2, x ์กฐํฉ) | `activation.layers.PolyNorm` | x, weight(3), bias(1), eps | |
| | `mul_poly` | custom `FusedMulPolyNorm` | `activation.layers.FusedMulPolyNorm` | x, mul, weight(3), bias, eps | |
| | `grouped_mul_poly` | `fused_mul_grouped_poly_norm_ref` | `fused_mul_grouped_poly_norm` | x, mul, weight(num_experts, 3), bias, offsets | |
| |
| `grouped_mul_poly`๋ ์ถ๊ฐ๋ก `compiled`(torch.compile๋ naive)์ `compiled_cuda`(torch.compile๋ CUDA) provider๋ ์ธก์ ํ๋ค. |
|
|
| ## Execution Flow |
|
|
| 1. **์ ํ๋ ๊ฒ์ฆ** - 3๊ฐ config์ ๋ํด `calculate_diff()` ์คํ |
| - `(bs=2, sl=128, hidden=4096)` |
| - `(bs=8, sl=4096, hidden=1280)` |
| - `(bs=1, sl=32768, hidden=1280)` |
| - forward/backward ๋ชจ๋ `atol=1e-2, rtol=1e-2`๋ก ๋น๊ต |
| 2. **๋ฒค์น๋งํฌ ์คํ** - dtype๋ณ๋ก forward/backward ์ฑ๋ฅ ์ธก์ |
| 3. **๊ฒฐ๊ณผ ์ ์ฅ** - CSV ํ์ผ (๋ฐ ์ ํ์ ์ผ๋ก ํ๋กฏ/trace) |
|
|
| ## Configuration Ranges |
|
|
| **Standard cases** (rms, add_rms, poly, mul_poly): |
| - Batch sizes: 1, 2, 4, 8 |
| - Sequence lengths: 1024, 2048, 4096, 8192 |
| - Hidden dims: 2048, 4096 |
|
|
| **Grouped case** (grouped_mul_poly): |
| - Total tokens: 1024 ~ 65536 (bs x sl) |
| - Hidden dim: 1280 (๊ณ ์ ) |
| - Experts: 48 per rank |
|
|
| `--plot` ๋ชจ๋์์๋ `bs=1`๋ก ๊ณ ์ ํ๊ณ seq_len๋ง sweepํ๋ค. |
| |
| ## Output |
| |
| ### CSV |
| |
| `{save_path}/{case}/{dtype}/` ๋๋ ํ ๋ฆฌ์ ์ ์ฅ: |
|
|
| - `{case}-{dtype}-fwd-perf.csv` - forward ๊ฒฐ๊ณผ |
| - `{case}-{dtype}-bwd-perf.csv` - backward ๊ฒฐ๊ณผ |
|
|
| ์ปฌ๋ผ: `dim`, `batch_size`, `seq_len`, `Naive (us)`, `Compiled (us)`, `Cuda (us)`, `SpeedUp (us)` |
|
|
| ### Chrome Trace (`--profile`) |
|
|
| `{save_path}/{case}/{dtype}/traces/` ๋๋ ํ ๋ฆฌ์ JSON ํ์์ผ๋ก ์ ์ฅ. `chrome://tracing`์์ ๋ก๋ํ์ฌ GPU ํ์๋ผ์ธ์ ๋ถ์ํ ์ ์๋ค. |
|
|
| ํ์ผ๋ช
ํจํด: `trace_{fwd|bwd}_{naive|compiled|cuda|compiled_cuda}_N{total_tokens}.json` |
|
|
| ### Plot (`--plot`) |
|
|
| Speedup ๋น๊ต ํ๋กฏ ์์ฑ. Geometric mean์ผ๋ก ์ ์ฒด speedup์ ์ง๊ณํ๋ค. |
|
|
| ## Framework Internals |
|
|
| ### bench_framework.py |
| |
| Triton์ `perf_report`/`Benchmark`๋ฅผ ์ฌ์ฉํ๋ 4๊ฐ ํฉํ ๋ฆฌ ํจ์: |
|
|
| - `make_fwd_benchmark_for_case()` - forward ๋ฒค์น๋งํฌ (CSV) |
| - `make_bwd_benchmark_for_case()` - backward ๋ฒค์น๋งํฌ (CSV) |
| - `make_fwd_benchmark_plot_for_case()` - forward ํ๋กฏ |
| - `make_bwd_benchmark_plot_for_case()` - backward ํ๋กฏ |
|
|
| ํ์ด๋ฐ์ `triton.testing.do_bench()`๋ก ์ธก์ ํ๋ฉฐ, ms ๋จ์๋ฅผ us๋ก ๋ณํํ๋ค (`time_unit_scale=1000`). |
|
|
| ### diff_engine.py |
| |
| `DiffCase` ABC ์ธํฐํ์ด์ค: |
| |
| - `build_inputs(bs, sl, dim)` - ์
๋ ฅ ํ
์ ์์ฑ |
| - `make_naive()` / `make_cuda()` - ๊ตฌํ์ฒด ์์ฑ |
| - `forward(module, inputs)` - forward ์คํ |
| - `grad_inputs(inputs)` - gradient ๋์ ํ
์ ๋ฐํ |
|
|
| `calculate_diff()`๊ฐ naive์ CUDA ์์ชฝ์ forward output + backward gradient๋ฅผ `torch.testing.assert_close()`๋ก ๋น๊ตํ๋ค. |
|
|
| ## Kubeflow Integration |
|
|
| `benchmark.yaml`๋ก ํด๋ฌ์คํฐ์์ ๋ฒค์น๋งํฌ๋ฅผ ์คํํ ์ ์๋ค: |
|
|
| - triton, matplotlib, pandas ์ค์น |
| - C++ extension ๋น๋ (`setup.py`) |
| - GPU warmup (100 iterations matmul) |
| - ๊ฒฐ๊ณผ๋ฅผ `benchmarks/results/{YY_MM_DD_HH_MM}/`์ ์ ์ฅ |
|
|