File size: 3,576 Bytes
af83196
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# GPU Mode: Triton Kernel Optimization

Evolve high-performance GPU kernels using SkyDiscover. Each benchmark provides a reference PyTorch implementation and scores submissions by runtime β€” faster is better. Pure PyTorch submissions are accepted; Triton is not required.

## Benchmarks

| Benchmark | Operation | Tolerance | GPU |
|-----------|-----------|-----------|-----|
| [`vecadd`](vecadd/) | Float16 element-wise `C = A + B` | rtol/atol=1e-3 | H100 |
| [`grayscale`](grayscale/) | RGB β†’ Grayscale (`0.2989R + 0.5870G + 0.1140B`) | rtol/atol=1e-4 | H100 |
| [`trimul`](trimul/) | Triangle multiplicative update (AlphaFold3/Chai/Protenix) | rtol/atol=0.02 | H100 |
| [`mla_decode`](mla_decode/) | Multi-head latent attention decode (DeepSeek-V2/V3) | rtol/atol=0.06 (bfloat16) | **H200** |

## Quick Start

```bash
# Run on local GPU
uv run skydiscover-run \
  benchmarks/gpu_mode/trimul/initial_program.py \
  benchmarks/gpu_mode/trimul/evaluator.py \
  -c benchmarks/gpu_mode/trimul/config.yaml \
  -s [your_algorithm] \
  -i 50

# Run on Modal cloud GPU (set GPU type per benchmark)
GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H100 \
  uv run skydiscover-run \
  benchmarks/gpu_mode/trimul/initial_program.py \
  benchmarks/gpu_mode/trimul/evaluator.py \
  -c benchmarks/gpu_mode/trimul/config.yaml \
  -s [your_algorithm] \
  -i 50
```

> **Note:** `mla_decode` requires `GPUMODE_MODAL_GPU=H200` β€” H100 (80GB) does not have enough VRAM.

## Writing a Submission

Your program must define a `custom_kernel(data)` function. The `data` argument is problem-specific (see each benchmark's `reference.py` for the exact type). Return the computed result.

```python
# EVOLVE-BLOCK-START
import torch
import triton
import triton.language as tl

def custom_kernel(data):
    # data is a problem-specific input (tensor, dataclass, etc.)
    # return the computed result
    ...
# EVOLVE-BLOCK-END
```

## Scoring

All benchmarks use the same formula:

```
combined_score = SCORE_SCALE / geom_mean_us
```

`geom_mean_us` is the geometric mean of kernel runtimes in microseconds across all benchmark cases. Higher score = faster kernel. `SCORE_SCALE` is `3000.0` for all current benchmarks.

`vecadd` uses a different combined formula (`0.3 * correctness + speedup`) β€” see its README for details.

## Evaluation Pipeline

The shared evaluator (`shared_eval.py`) handles both local and Modal paths:

1. **Correctness** β€” runs all `TEST_CASES` from `reference.py`, checks output against reference within tolerance
2. **Warmup** β€” runs one benchmark case briefly to trigger Triton JIT compilation
3. **Benchmark** β€” times `BENCHMARK_CASES` using CUDA events, repeats until error < 0.1% or time budget is exhausted
4. **Score** β€” geometric mean of benchmark runtimes β†’ `SCORE_SCALE / geom_mean_us`

## Directory Structure

```
gpu_mode/
β”œβ”€β”€ shared_eval.py       # Shared evaluator (correctness + benchmarking logic)
β”œβ”€β”€ modal_eval.py        # Modal cloud GPU runners (H100, A100, L40S, T4, H200)
β”œβ”€β”€ vecadd/              # Float16 vector addition
β”œβ”€β”€ grayscale/           # RGB β†’ grayscale conversion
β”œβ”€β”€ trimul/              # Triangle multiplicative update
└── mla_decode/          # MLA decode (DeepSeek attention)

# Each benchmark contains:
#   initial_program.py   β€” starting kernel
#   evaluator.py         β€” imports shared_eval, exposes evaluate()
#   reference.py         β€” reference kernel, test/benchmark cases, SCORE_SCALE
#   config.yaml          β€” search config
#   requirements.txt     β€” dependencies
```