File size: 5,532 Bytes

7d51e61

# Benchmark System

## Overview

벤치마크 시스템은 커스텀 CUDA 커널의 forward/backward 성능을 naive PyTorch 구현 대비 측정한다. Triton의 `do_bench`를 사용하며, 정확도 검증(correctness check) 후 성능을 측정한다.

## Directory Structure

```
benchmarks/
├── run_cases.py              # CLI 진입점
├── common/
│   ├── bench_framework.py    # 벤치마크 유틸리티 (Triton perf_report 기반)
│   └── diff_engine.py        # 정확도 검증 엔진 (DiffCase ABC)
├── cases/                    # 벤치마크 케이스 구현
│   ├── rms.py                # RMSNorm
│   ├── add_rms.py            # Fused Add + RMSNorm
│   ├── poly.py               # PolyNorm
│   ├── mul_poly.py           # Fused Mul + PolyNorm
│   └── grouped_mul_poly.py   # Grouped MoE Fused Mul + PolyNorm
├── benchmark.yaml            # Kubeflow 벤치마크 job config
├── test.yaml                 # Kubeflow 테스트 job config
├── plots/                    # 생성된 플롯 결과
└── results/                  # 타임스탬프별 벤치마크 결과
```

## Usage

```bash
python benchmarks/run_cases.py --case <CASE> [OPTIONS]
```

### Arguments

| Argument | Default | Choices | Description |
|----------|---------|---------|-------------|
| `--case` | (필수) | `rms`, `add_rms`, `poly`, `mul_poly`, `grouped_mul_poly` | 벤치마크 케이스 |
| `--dtype` | `bf16` | `fp16`, `bf16`, `fp32`, `all` | 데이터 타입 |
| `--save-path` | `./configs/` | 경로 | 결과 출력 디렉토리 |
| `--plot` | false | - | 플롯 생성 모드 |
| `--profile` | false | - | Chrome trace 프로파일링 내보내기 |

### Examples

```bash
# bf16 기본 벤치마크
python benchmarks/run_cases.py --case grouped_mul_poly

# 모든 dtype + 프로파일링
python benchmarks/run_cases.py --case mul_poly --dtype all --profile --save-path ./results

# 플롯만 생성
python benchmarks/run_cases.py --case rms --plot --save-path ./plots
```

## Benchmark Cases

각 케이스는 `DiffCase` ABC를 구현하며, naive(PyTorch 참조)와 CUDA 커널을 비교한다.

| Case | Naive | CUDA | Inputs |
|------|-------|------|--------|
| `rms` | `torch.nn.RMSNorm` | `activation.layers.RMSNorm` | x, weight, eps |
| `add_rms` | custom `FusedAddRMSNorm` | `activation.layers.FusedAddRMSNorm` | x, residual, weight, eps |
| `poly` | custom `PolyNorm` (x^3, x^2, x 조합) | `activation.layers.PolyNorm` | x, weight(3), bias(1), eps |
| `mul_poly` | custom `FusedMulPolyNorm` | `activation.layers.FusedMulPolyNorm` | x, mul, weight(3), bias, eps |
| `grouped_mul_poly` | `fused_mul_grouped_poly_norm_ref` | `fused_mul_grouped_poly_norm` | x, mul, weight(num_experts, 3), bias, offsets |

`grouped_mul_poly`는 추가로 `compiled`(torch.compile된 naive)와 `compiled_cuda`(torch.compile된 CUDA) provider도 측정한다.

## Execution Flow

1. **정확도 검증** - 3개 config에 대해 `calculate_diff()` 실행
   - `(bs=2, sl=128, hidden=4096)`
   - `(bs=8, sl=4096, hidden=1280)`
   - `(bs=1, sl=32768, hidden=1280)`
   - forward/backward 모두 `atol=1e-2, rtol=1e-2`로 비교
2. **벤치마크 실행** - dtype별로 forward/backward 성능 측정
3. **결과 저장** - CSV 파일 (및 선택적으로 플롯/trace)

## Configuration Ranges

**Standard cases** (rms, add_rms, poly, mul_poly):
- Batch sizes: 1, 2, 4, 8
- Sequence lengths: 1024, 2048, 4096, 8192
- Hidden dims: 2048, 4096

**Grouped case** (grouped_mul_poly):
- Total tokens: 1024 ~ 65536 (bs x sl)
- Hidden dim: 1280 (고정)
- Experts: 48 per rank

`--plot` 모드에서는 `bs=1`로 고정하고 seq_len만 sweep한다.

## Output

### CSV

`{save_path}/{case}/{dtype}/` 디렉토리에 저장:

- `{case}-{dtype}-fwd-perf.csv` - forward 결과
- `{case}-{dtype}-bwd-perf.csv` - backward 결과

컬럼: `dim`, `batch_size`, `seq_len`, `Naive (us)`, `Compiled (us)`, `Cuda (us)`, `SpeedUp (us)`

### Chrome Trace (`--profile`)

`{save_path}/{case}/{dtype}/traces/` 디렉토리에 JSON 형식으로 저장. `chrome://tracing`에서 로드하여 GPU 타임라인을 분석할 수 있다.

파일명 패턴: `trace_{fwd|bwd}_{naive|compiled|cuda|compiled_cuda}_N{total_tokens}.json`

### Plot (`--plot`)

Speedup 비교 플롯 생성. Geometric mean으로 전체 speedup을 집계한다.

## Framework Internals

### bench_framework.py

Triton의 `perf_report`/`Benchmark`를 사용하는 4개 팩토리 함수:

- `make_fwd_benchmark_for_case()` - forward 벤치마크 (CSV)
- `make_bwd_benchmark_for_case()` - backward 벤치마크 (CSV)
- `make_fwd_benchmark_plot_for_case()` - forward 플롯
- `make_bwd_benchmark_plot_for_case()` - backward 플롯

타이밍은 `triton.testing.do_bench()`로 측정하며, ms 단위를 us로 변환한다 (`time_unit_scale=1000`).

### diff_engine.py

`DiffCase` ABC 인터페이스:

- `build_inputs(bs, sl, dim)` - 입력 텐서 생성
- `make_naive()` / `make_cuda()` - 구현체 생성
- `forward(module, inputs)` - forward 실행
- `grad_inputs(inputs)` - gradient 대상 텐서 반환

`calculate_diff()`가 naive와 CUDA 양쪽의 forward output + backward gradient를 `torch.testing.assert_close()`로 비교한다.

## Kubeflow Integration

`benchmark.yaml`로 클러스터에서 벤치마크를 실행할 수 있다:

- triton, matplotlib, pandas 설치
- C++ extension 빌드 (`setup.py`)
- GPU warmup (100 iterations matmul)
- 결과를 `benchmarks/results/{YY_MM_DD_HH_MM}/`에 저장