JustinTX commited on
Commit
af83196
·
verified ·
1 Parent(s): 7f611c5

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .github/workflows/ci.yml +51 -0
  2. benchmarks/README.md +383 -0
  3. benchmarks/__init__.py +1 -0
  4. benchmarks/gpu_mode/README.md +92 -0
  5. benchmarks/gpu_mode/grayscale/README.md +34 -0
  6. benchmarks/gpu_mode/grayscale/config.yaml +176 -0
  7. benchmarks/gpu_mode/grayscale/evaluator.py +13 -0
  8. benchmarks/gpu_mode/grayscale/initial_program.py +57 -0
  9. benchmarks/gpu_mode/grayscale/reference.py +103 -0
  10. benchmarks/gpu_mode/grayscale/requirements.txt +2 -0
  11. benchmarks/gpu_mode/mla_decode/README.md +36 -0
  12. benchmarks/gpu_mode/mla_decode/evaluator.py +13 -0
  13. benchmarks/gpu_mode/modal_eval.py +259 -0
  14. benchmarks/gpu_mode/shared_eval.py +421 -0
  15. benchmarks/gpu_mode/trimul/README.md +34 -0
  16. benchmarks/gpu_mode/trimul/config.yaml +219 -0
  17. benchmarks/gpu_mode/trimul/evaluator.py +13 -0
  18. benchmarks/gpu_mode/trimul/reference.py +286 -0
  19. benchmarks/gpu_mode/trimul/requirements.txt +2 -0
  20. benchmarks/gpu_mode/vecadd/README.md +34 -0
  21. benchmarks/gpu_mode/vecadd/config.yaml +50 -0
  22. benchmarks/gpu_mode/vecadd/evaluator.py +13 -0
  23. benchmarks/gpu_mode/vecadd/initial_program.py +39 -0
  24. benchmarks/gpu_mode/vecadd/reference.py +96 -0
  25. benchmarks/gpu_mode/vecadd/requirements.txt +2 -0
  26. benchmarks/kernelbench/README.md +211 -0
  27. benchmarks/kernelbench/__init__.py +0 -0
  28. benchmarks/kernelbench/config.yaml +86 -0
  29. benchmarks/kernelbench/evaluator/Dockerfile +25 -0
  30. benchmarks/kernelbench/evaluator/evaluate.sh +6 -0
  31. benchmarks/kernelbench/evaluator/evaluator.py +227 -0
  32. benchmarks/kernelbench/evaluator/requirements.txt +2 -0
  33. benchmarks/kernelbench/evaluator/wrapper.py +98 -0
  34. benchmarks/kernelbench/requirements.txt +4 -0
  35. benchmarks/kernelbench/resolver.py +136 -0
  36. benchmarks/math/first_autocorr_ineq/evaluator/evaluate.sh +7 -0
  37. benchmarks/math/second_autocorr_ineq/evaluator/evaluate.sh +7 -0
  38. benchmarks/math/second_autocorr_ineq/evaluator/evaluator.py +95 -0
  39. benchmarks/math/second_autocorr_ineq/evaluator/requirements.txt +15 -0
  40. benchmarks/math/second_autocorr_ineq/evaluator/wrapper.py +98 -0
  41. docs/.gitignore +3 -0
  42. docs/README.md +13 -0
  43. docs/app/api/search/route.ts +6 -0
  44. docs/app/docs/[[...slug]]/page.tsx +43 -0
  45. docs/app/docs/layout.tsx +11 -0
  46. docs/app/global.css +3 -0
  47. docs/app/layout.tsx +17 -0
  48. docs/app/page.tsx +22 -0
  49. docs/content/docs/getting-started/configuration.mdx +117 -0
  50. docs/content/docs/getting-started/index.mdx +11 -0
.github/workflows/ci.yml ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [main]
6
+ pull_request:
7
+
8
+ concurrency:
9
+ group: ${{ github.workflow }}-${{ github.ref }}
10
+ cancel-in-progress: true
11
+
12
+ jobs:
13
+ lint:
14
+ runs-on: ubuntu-latest
15
+ timeout-minutes: 5
16
+ steps:
17
+ - uses: actions/checkout@v4
18
+ - uses: astral-sh/setup-uv@v4
19
+ with:
20
+ python-version: "3.10"
21
+ enable-cache: true
22
+ - run: uv sync --frozen --extra dev
23
+ - run: uv run black --check skydiscover/
24
+ - run: uv run isort --check skydiscover/
25
+
26
+ test:
27
+ runs-on: ubuntu-latest
28
+ timeout-minutes: 10
29
+ steps:
30
+ - uses: actions/checkout@v4
31
+ - uses: astral-sh/setup-uv@v4
32
+ with:
33
+ python-version: "3.10"
34
+ enable-cache: true
35
+ - run: uv sync --frozen --extra dev
36
+ - name: Smoke test — package imports cleanly
37
+ run: uv run python -c "from skydiscover import Runner, run_discovery, discover_solution, __version__; print(f'skydiscover {__version__} OK')"
38
+ - name: Run tests
39
+ run: uv run pytest tests/ -v
40
+
41
+ build:
42
+ runs-on: ubuntu-latest
43
+ timeout-minutes: 5
44
+ needs: [lint, test]
45
+ steps:
46
+ - uses: actions/checkout@v4
47
+ - uses: astral-sh/setup-uv@v4
48
+ with:
49
+ python-version: "3.10"
50
+ enable-cache: true
51
+ - run: uv build
benchmarks/README.md ADDED
@@ -0,0 +1,383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Benchmarks
2
+
3
+ ~200 optimization tasks across math, systems, algorithms, and reasoning.
4
+
5
+ ## Quick Start
6
+
7
+ Install dependencies first:
8
+
9
+ ```bash
10
+ # Base
11
+ uv sync
12
+
13
+ # Choose extras based on what you run
14
+ uv sync --extra external # openevolve/gepa/shinkaevolve
15
+ uv sync --extra math # math benchmarks
16
+ uv sync --extra adrs # ADRS benchmarks
17
+ uv sync --extra frontier-cs # frontier-cs-eval benchmark
18
+ uv sync --extra prompt-optimization # HotPotQA prompt benchmark
19
+ ```
20
+
21
+ If a benchmark directory has `requirements.txt`, also run:
22
+
23
+ ```bash
24
+ uv pip install -r benchmarks/<task>/requirements.txt
25
+ ```
26
+
27
+ Then run:
28
+
29
+ ```bash
30
+ export OPENAI_API_KEY="..."
31
+
32
+ # Containerized benchmark (recommended — evaluator runs in Docker)
33
+ uv run skydiscover-run benchmarks/math/circle_packing_rect/initial_program.py \
34
+ benchmarks/math/circle_packing_rect/evaluator \
35
+ -c benchmarks/math/circle_packing_rect/config.yaml \
36
+ -s best_of_n -i 50
37
+
38
+ # Plain Python evaluator (runs on host)
39
+ uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
40
+ benchmarks/math/circle_packing/evaluator.py \
41
+ -c benchmarks/math/circle_packing/config.yaml \
42
+ -s best_of_n -i 100
43
+ ```
44
+
45
+ ## Tasks
46
+
47
+ | Benchmark | Domain | Tasks | What it tests |
48
+ |-----------|--------|-------|---------------|
49
+ | [`math/`](math/) | Math | 14 | Circle packing, Erdos problems, autocorrelation inequalities, geometric optimization |
50
+ | [`ADRS/`](ADRS/) | Systems | 5 | Cloud scheduling, MoE load balancing, model placement, column reordering, transaction scheduling |
51
+ | [`gpu_mode/`](gpu_mode/) | GPU | 4 | Triton kernel optimization (vecadd, grayscale, trimul, MLA decode) |
52
+ | [`frontier-cs-eval/`](frontier-cs-eval/) | Algorithms | 172 | Competitive programming (Frontier-CS benchmark, Docker judge) |
53
+ | [`arc_benchmark/`](arc_benchmark/) | Reasoning | — | ARC-AGI visual reasoning tasks |
54
+ | [`ale_bench/`](ale_bench/) | Algorithms | 10 | Algorithmic contest problems (C++, ALE-Bench) |
55
+ | [`image_gen/`](image_gen/) | Creative | 1 | AI image generation evolution |
56
+ | [`prompt_optimization/`](prompt_optimization/) | Prompts | 1 | Evolve natural-language prompts, not code (HotPotQA) |
57
+
58
+ Each benchmark directory has its own README with setup and run instructions.
59
+
60
+ ## Structure
61
+
62
+ There are three ways to set up a benchmark: a **containerized evaluator** (recommended for new benchmarks), a **Harbor task** (for external benchmark suites), or a **plain Python evaluator** (simplest).
63
+
64
+ ### Containerized evaluator (recommended)
65
+
66
+ ```
67
+ <task>/
68
+ ├── initial_program.py # Starting solution
69
+ ├── config.yaml # System prompt + search/evaluator settings
70
+ └── evaluator/ # Self-contained Docker benchmark
71
+ ├── Dockerfile
72
+ ├── evaluate.sh # Entrypoint (receives solution path + mode)
73
+ ├── evaluator.py # Scoring logic
74
+ ├── requirements.txt # Python dependencies
75
+ └── ... # Any other data/files the evaluator needs
76
+ ```
77
+
78
+ The `evaluator/` directory is the Docker build context. Everything inside it gets copied into the image — data files, model weights, test fixtures, etc. SkyDiscover auto-detects this layout when `evaluation_file` points to a directory containing a `Dockerfile` and `evaluate.sh`.
79
+
80
+ ### Plain Python evaluator
81
+
82
+ ```
83
+ <task>/
84
+ ├── initial_program.py # Starting solution
85
+ ├── evaluator.py # Scoring function (returns combined_score)
86
+ └── config.yaml # System prompt + search/evaluator settings
87
+ ```
88
+
89
+ Simpler but runs evaluator code directly on the host. Fine for pure-Python tasks with no system dependencies.
90
+
91
+ ### Benchmark resolvers (dynamic problem loading)
92
+
93
+ Some benchmarks support **dynamic problem loading** through a resolver pattern. Instead of providing a static `initial_program.py`, the resolver fetches problems from an external dataset based on configuration parameters.
94
+
95
+ This is useful for benchmark suites with many problems (e.g., KernelBench has hundreds of GPU kernel optimization tasks). The resolver pattern allows you to:
96
+
97
+ 1. Select specific problems via config parameters (e.g., difficulty level, problem ID)
98
+ 2. Automatically generate the initial program from the benchmark dataset
99
+ 3. Configure evaluator settings based on the problem specification
100
+
101
+ #### Using a benchmark with a resolver
102
+
103
+ Benchmarks that support resolvers include a `benchmark` section in their `config.yaml`:
104
+
105
+ ```yaml
106
+ benchmark:
107
+ enabled: true # Enable benchmark loader
108
+ name: kernelbench # Benchmark name (for logging)
109
+ resolver: benchmarks.kernelbench.resolver # Python module path
110
+
111
+ # Benchmark-specific parameters
112
+ level: 2 # Example: difficulty level
113
+ problem_id: 5 # Example: specific problem ID
114
+ ```
115
+
116
+ When running such a benchmark, you don't need to provide an `initial_program` argument:
117
+
118
+ ```bash
119
+ uv run skydiscover-run benchmarks/kernelbench/evaluator/ \
120
+ -c benchmarks/kernelbench/config.yaml \
121
+ --search adaevolve \
122
+ --iterations 50
123
+ ```
124
+
125
+ The resolver automatically fetches the problem and generates the initial program based on the config parameters.
126
+
127
+ #### Implementing a benchmark resolver
128
+
129
+ To add resolver support to a new benchmark:
130
+
131
+ 1. **Create `benchmarks/your_benchmark/resolver.py`** implementing the `BenchmarkResolver` interface:
132
+
133
+ ```python
134
+ from pathlib import Path
135
+ from typing import Any, Dict, Tuple
136
+ from skydiscover.benchmarks.base import BenchmarkResolver
137
+
138
+ class YourBenchmarkResolver(BenchmarkResolver):
139
+ def resolve(self, config: Dict[str, Any], output_dir: Path) -> Tuple[str, str]:
140
+ """
141
+ Fetch problem and generate initial program.
142
+
143
+ Args:
144
+ config: The benchmark section from config.yaml
145
+ output_dir: Directory where generated files should be placed
146
+
147
+ Returns:
148
+ BenchmarkResolution containing:
149
+ - initial_program_path: Path to generated initial program
150
+ - evaluator_path: Path to evaluator
151
+ - evaluator_env_vars: Dict of environment variables for the evaluator
152
+ """
153
+ # 1. Fetch problem from dataset based on config parameters
154
+ # 2. Generate initial_program.py with EVOLVE-BLOCK markers
155
+ # 3. Prepare evaluator environment variables (returned, not set globally)
156
+ # 4. Return BenchmarkResolution with paths and env vars
157
+
158
+ initial_program_path = output_dir / "initial_program.py"
159
+ evaluator_path = Path(__file__).parent / "evaluator"
160
+ evaluator_env_vars = {
161
+ "BENCHMARK_PARAM": "value",
162
+ # Add benchmark-specific configuration here
163
+ }
164
+
165
+ return BenchmarkResolution(
166
+ initial_program_path=str(initial_program_path),
167
+ evaluator_path=str(evaluator_path),
168
+ evaluator_env_vars=evaluator_env_vars,
169
+ )
170
+
171
+ # Module-level resolver instance
172
+ resolver = YourBenchmarkResolver()
173
+ ```
174
+
175
+ 2. **Add `benchmark` section to `config.yaml`** with your resolver module path and all benchmark-specific parameters
176
+
177
+ 3. **Use the same CLI pattern** (no initial_program argument needed)
178
+
179
+ See the implementation in:
180
+ - `skydiscover/benchmarks/base.py` - Base resolver interface
181
+ - `benchmarks/kernelbench/resolver.py` - KernelBench example implementation
182
+
183
+ ## Adding a Benchmark
184
+
185
+ ### Option 1: Containerized evaluator (recommended)
186
+
187
+ Containerized evaluators run inside Docker, so they can have arbitrary dependencies, system packages, data files, etc. without polluting the host. Only two files are **required**: `Dockerfile` and `evaluate.sh`.
188
+
189
+ #### `evaluate.sh`
190
+
191
+ The entrypoint that SkyDiscover calls. It receives two arguments:
192
+
193
+ ```bash
194
+ #!/usr/bin/env bash
195
+ set -euo pipefail
196
+
197
+ PROGRAM="$1" # Path to the candidate solution inside the container
198
+ MODE="$2" # "train" (fast, iterative) or "test" (authoritative, final)
199
+
200
+ python /benchmark/evaluator.py "$PROGRAM"
201
+ ```
202
+
203
+ - **train** mode is called during the optimization loop — should be relatively fast.
204
+ - **test** mode is called once at the end for the best solution — should be the full, authoritative evaluation.
205
+
206
+ Evaluators that don't need the distinction can ignore `$MODE`.
207
+
208
+ #### `evaluate.sh` output (JSON protocol)
209
+
210
+ `evaluate.sh` must write a **single JSON object to stdout**:
211
+
212
+ ```json
213
+ {
214
+ "status": "success",
215
+ "combined_score": 0.73,
216
+ "metrics": {"combined_score": 0.73, "accuracy": 0.85, "speed": 1.2},
217
+ "artifacts": {"error": "...", "details": "..."}
218
+ }
219
+ ```
220
+
221
+ - `combined_score` (float, required): the primary optimization target.
222
+ - `metrics` (dict of string → float): all numeric scores. Must include `combined_score`.
223
+ - `artifacts` (dict of string → string, optional): non-numeric context (errors, diagnostics).
224
+ - `status`: `"success"`, `"error"`, or `"timeout"`.
225
+
226
+ Any output to **stderr** is captured for debugging but does not affect scoring. If your evaluator prints debug output, make sure it goes to stderr, not stdout.
227
+
228
+ #### `Dockerfile`
229
+
230
+ A standard Dockerfile. The only requirement is that `evaluate.sh` is executable:
231
+
232
+ ```dockerfile
233
+ FROM python:3.12-slim
234
+ WORKDIR /benchmark
235
+
236
+ COPY requirements.txt .
237
+ RUN pip install --no-cache-dir -r requirements.txt
238
+
239
+ COPY . .
240
+ RUN chmod +x evaluate.sh
241
+
242
+ ENTRYPOINT ["./evaluate.sh"]
243
+ ```
244
+
245
+ #### Migrating an existing Python evaluator
246
+
247
+ If you have an existing `evaluate(program_path) -> dict` function, you can wrap it with the backwards-compatibility wrapper:
248
+
249
+ 1. Copy `skydiscover/evaluation/wrapper.py` into your `evaluator/` directory.
250
+ 2. Add this to the bottom of your `evaluator.py`:
251
+
252
+ ```python
253
+ if __name__ == "__main__":
254
+ from wrapper import run
255
+ run(evaluate)
256
+ ```
257
+
258
+ The wrapper handles stdout redirection (so debug prints don't corrupt JSON), error formatting, and metric/artifact separation.
259
+
260
+ #### Running a containerized benchmark
261
+
262
+ Point `evaluation_file` at the `evaluator/` directory:
263
+
264
+ ```bash
265
+ skydiscover-run benchmarks/math/circle_packing_rect/initial_program.py \
266
+ benchmarks/math/circle_packing_rect/evaluator \
267
+ -c benchmarks/math/circle_packing_rect/config.yaml \
268
+ -s best_of_n -i 50
269
+ ```
270
+
271
+ SkyDiscover will automatically build the Docker image, start a persistent container, and run evaluations inside it.
272
+
273
+ #### Example to copy
274
+
275
+ Simple containerized benchmark: [`math/heilbronn_triangle/`](math/heilbronn_triangle/)
276
+
277
+ ### Option 2: Harbor tasks (external benchmarks)
278
+
279
+ SkyDiscover natively supports [Harbor](https://harborframework.com/)-format tasks. This lets you run external benchmark suites like [AlgoTune](https://github.com/oripress/AlgoTune) (154 algorithm optimization tasks) without any conversion.
280
+
281
+ A Harbor task directory looks like this:
282
+
283
+ ```
284
+ task_dir/
285
+ ├── task.toml # Metadata, timeouts
286
+ ├── instruction.md # Problem description (shown to the LLM as context)
287
+ ├── environment/
288
+ │ └── Dockerfile # Container image definition
289
+ ├── tests/
290
+ │ ├── test.sh # Verification entrypoint
291
+ │ └── ... # Supporting test files (evaluator.py, test data, etc.)
292
+ └── solution/ # Reference solution (optional, not shown to LLM)
293
+ └── solve.sh
294
+ ```
295
+
296
+ SkyDiscover auto-detects Harbor tasks when the directory contains `instruction.md`, `tests/`, and `environment/Dockerfile`. The `instruction.md` is used as LLM context, solutions are injected at the path extracted from `solution/solve.sh` (or `instruction.md` as fallback), and rewards are read from `/logs/verifier/reward.txt` or `reward.json`.
297
+
298
+ #### Tested Harbor datasets
299
+
300
+ SkyDiscover has been tested with the following Harbor registry benchmarks:
301
+
302
+ | Dataset | Tasks | Domain | Language | Install |
303
+ |---------|-------|--------|----------|---------|
304
+ | [algotune](https://github.com/oripress/AlgoTune) | 154 | Algorithm optimization (speedup scoring) | Python | `harbor datasets download algotune@1.0` |
305
+ | [evoeval](https://github.com/evo-eval/evoeval) | 100 | Code generation (evolved from HumanEval) | Python | `harbor datasets download evoeval@1.0` |
306
+ | [humanevalfix](https://github.com/bigcode-project/octopack) | 164 | Code repair (fix buggy functions) | Python | `harbor datasets download humanevalfix@1.0` |
307
+ | [bigcodebench-hard-complete](https://github.com/bigcode-project/bigcodebench) | 145 | Python programming (reward-based) | Python | `harbor datasets download bigcodebench-hard-complete@1.0.0` |
308
+ | [livecodebench](https://livecodebench.github.io/) | 100 | Competitive programming (stdin/stdout) | Python | `harbor datasets download livecodebench@6.0` |
309
+ | [codepde](https://github.com/LithiumDA/CodePDE) | 5 | Scientific computing (PDE solvers) | Python | `harbor datasets download codepde@1.0` |
310
+ | [crustbench](https://github.com/AInfinity/CRUSTBench) | 100 | C-to-safe-Rust transpilation | Rust | `harbor datasets download crustbench@1.0` |
311
+ | [usaco](https://usaco.org/) | 304 | Competition programming (USACO) | Python | `harbor datasets download usaco@2.0` |
312
+
313
+ Any Harbor-compatible dataset should work — the evaluator automatically extracts the solution path from the task's `solution/solve.sh` script.
314
+
315
+ #### Running a Harbor task
316
+
317
+ 1. **Install the Harbor CLI and download a dataset:**
318
+
319
+ ```bash
320
+ pip install harbor
321
+ harbor datasets download algotune@1.0 -o /tmp/algotune
322
+ ```
323
+
324
+ This downloads all 154 AlgoTune tasks. Each task is in a subdirectory like `/tmp/algotune/<id>/algotune-<name>/`.
325
+
326
+ 2. **Run SkyDiscover**, pointing at the task directory. The LLM uses `instruction.md` as context and generates solutions from scratch:
327
+
328
+ ```bash
329
+ # AlgoTune (algorithm optimization)
330
+ TASK=/tmp/algotune/2HHbpvzVPo2qakaoGyAVS2/algotune-set-cover
331
+ skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 10
332
+
333
+ # EvoEval (code generation)
334
+ harbor datasets download evoeval@1.0 -o /tmp/evoeval
335
+ TASK=/tmp/evoeval/<id>/<task-name>
336
+ skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 5
337
+
338
+ # HumanEvalFix (code repair)
339
+ harbor datasets download humanevalfix@1.0 -o /tmp/humanevalfix
340
+ TASK=/tmp/humanevalfix/<id>/<task-name>
341
+ skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 5
342
+ ```
343
+
344
+ SkyDiscover will build the Docker image from `environment/Dockerfile`, upload `tests/` into the container, and start optimizing.
345
+
346
+ > **Note:** Some datasets have heavy Dockerfiles. AlgoTune needs ~10GB disk and 16GB RAM (torch, jax, scipy). BigCodeBench installs R, GDAL, and many system packages. First builds are slow; subsequent runs use Docker cache.
347
+
348
+ #### Other Harbor datasets
349
+
350
+ Any Harbor-compatible dataset works the same way. Run `harbor datasets list` to see all available datasets, then `harbor datasets download <name>` to fetch them.
351
+
352
+ ### Option 3: Plain Python evaluator
353
+
354
+ For simple tasks with no system dependencies, you can use a plain Python evaluator that runs on the host.
355
+
356
+ **Evaluator** (`evaluator.py`) scores whatever the LLM produces:
357
+
358
+ ```python
359
+ def evaluate(program_path: str) -> dict:
360
+ # load and run the program, compute a score
361
+ return {"combined_score": 0.73, ...} # combined_score is required
362
+ ```
363
+
364
+ `program_path` is a `.py` file for code tasks or a `.txt` file for prompt tasks. On failure, return `{"combined_score": 0.0, "error": "..."}` instead of raising.
365
+
366
+ ### Seed program
367
+
368
+ **Seed** (`initial_program.py` or `initial_prompt.txt`) is the starting solution. Mark the region for the LLM to evolve:
369
+
370
+ ```python
371
+ # EVOLVE-BLOCK-START
372
+ def solve(input_data):
373
+ return input_data # LLM will improve this
374
+ # EVOLVE-BLOCK-END
375
+ ```
376
+
377
+ For prompt optimization, use a plain `.txt` file with no markers.
378
+
379
+ ### Config
380
+
381
+ **Config** (`config.yaml`) sets the system prompt and search settings. For prompt optimization, set `language: text` and `diff_based_generation: false`.
382
+
383
+ Simple prompt example to copy: [`prompt_optimization/hotpot_qa/`](prompt_optimization/hotpot_qa/)
benchmarks/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """SkyDiscover benchmarks package. This is used for importing benchmark resolver modules."""
benchmarks/gpu_mode/README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Triton Kernel Optimization
2
+
3
+ Evolve high-performance GPU kernels using SkyDiscover. Each benchmark provides a reference PyTorch implementation and scores submissions by runtime — faster is better. Pure PyTorch submissions are accepted; Triton is not required.
4
+
5
+ ## Benchmarks
6
+
7
+ | Benchmark | Operation | Tolerance | GPU |
8
+ |-----------|-----------|-----------|-----|
9
+ | [`vecadd`](vecadd/) | Float16 element-wise `C = A + B` | rtol/atol=1e-3 | H100 |
10
+ | [`grayscale`](grayscale/) | RGB → Grayscale (`0.2989R + 0.5870G + 0.1140B`) | rtol/atol=1e-4 | H100 |
11
+ | [`trimul`](trimul/) | Triangle multiplicative update (AlphaFold3/Chai/Protenix) | rtol/atol=0.02 | H100 |
12
+ | [`mla_decode`](mla_decode/) | Multi-head latent attention decode (DeepSeek-V2/V3) | rtol/atol=0.06 (bfloat16) | **H200** |
13
+
14
+ ## Quick Start
15
+
16
+ ```bash
17
+ # Run on local GPU
18
+ uv run skydiscover-run \
19
+ benchmarks/gpu_mode/trimul/initial_program.py \
20
+ benchmarks/gpu_mode/trimul/evaluator.py \
21
+ -c benchmarks/gpu_mode/trimul/config.yaml \
22
+ -s [your_algorithm] \
23
+ -i 50
24
+
25
+ # Run on Modal cloud GPU (set GPU type per benchmark)
26
+ GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H100 \
27
+ uv run skydiscover-run \
28
+ benchmarks/gpu_mode/trimul/initial_program.py \
29
+ benchmarks/gpu_mode/trimul/evaluator.py \
30
+ -c benchmarks/gpu_mode/trimul/config.yaml \
31
+ -s [your_algorithm] \
32
+ -i 50
33
+ ```
34
+
35
+ > **Note:** `mla_decode` requires `GPUMODE_MODAL_GPU=H200` — H100 (80GB) does not have enough VRAM.
36
+
37
+ ## Writing a Submission
38
+
39
+ Your program must define a `custom_kernel(data)` function. The `data` argument is problem-specific (see each benchmark's `reference.py` for the exact type). Return the computed result.
40
+
41
+ ```python
42
+ # EVOLVE-BLOCK-START
43
+ import torch
44
+ import triton
45
+ import triton.language as tl
46
+
47
+ def custom_kernel(data):
48
+ # data is a problem-specific input (tensor, dataclass, etc.)
49
+ # return the computed result
50
+ ...
51
+ # EVOLVE-BLOCK-END
52
+ ```
53
+
54
+ ## Scoring
55
+
56
+ All benchmarks use the same formula:
57
+
58
+ ```
59
+ combined_score = SCORE_SCALE / geom_mean_us
60
+ ```
61
+
62
+ `geom_mean_us` is the geometric mean of kernel runtimes in microseconds across all benchmark cases. Higher score = faster kernel. `SCORE_SCALE` is `3000.0` for all current benchmarks.
63
+
64
+ `vecadd` uses a different combined formula (`0.3 * correctness + speedup`) — see its README for details.
65
+
66
+ ## Evaluation Pipeline
67
+
68
+ The shared evaluator (`shared_eval.py`) handles both local and Modal paths:
69
+
70
+ 1. **Correctness** — runs all `TEST_CASES` from `reference.py`, checks output against reference within tolerance
71
+ 2. **Warmup** — runs one benchmark case briefly to trigger Triton JIT compilation
72
+ 3. **Benchmark** — times `BENCHMARK_CASES` using CUDA events, repeats until error < 0.1% or time budget is exhausted
73
+ 4. **Score** — geometric mean of benchmark runtimes → `SCORE_SCALE / geom_mean_us`
74
+
75
+ ## Directory Structure
76
+
77
+ ```
78
+ gpu_mode/
79
+ ├── shared_eval.py # Shared evaluator (correctness + benchmarking logic)
80
+ ├── modal_eval.py # Modal cloud GPU runners (H100, A100, L40S, T4, H200)
81
+ ├── vecadd/ # Float16 vector addition
82
+ ├── grayscale/ # RGB → grayscale conversion
83
+ ├── trimul/ # Triangle multiplicative update
84
+ └── mla_decode/ # MLA decode (DeepSeek attention)
85
+
86
+ # Each benchmark contains:
87
+ # initial_program.py — starting kernel
88
+ # evaluator.py — imports shared_eval, exposes evaluate()
89
+ # reference.py — reference kernel, test/benchmark cases, SCORE_SCALE
90
+ # config.yaml — search config
91
+ # requirements.txt — dependencies
92
+ ```
benchmarks/gpu_mode/grayscale/README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: RGB to Grayscale
2
+
3
+ Evolve a Triton kernel for RGB to Grayscale conversion using SkyDiscover.
4
+
5
+ **Formula:** `Y = 0.2989 * R + 0.5870 * G + 0.1140 * B`
6
+
7
+ ## Quick Start
8
+
9
+ From the repo root:
10
+
11
+ ```bash
12
+ uv run skydiscover-run \
13
+ benchmarks/gpu_mode/grayscale/initial_program.py \
14
+ benchmarks/gpu_mode/grayscale/evaluator.py \
15
+ -c benchmarks/gpu_mode/grayscale/config.yaml \
16
+ -s [your_algorithm] -i 50
17
+ ```
18
+
19
+ ## Scoring
20
+
21
+ - **Correctness:** Must pass all test cases (rtol/atol=1e-4 vs PyTorch reference)
22
+ - **Score:** `SCORE_SCALE / geom_mean_us` where `SCORE_SCALE = 3000.0`
23
+ - Higher is better (faster runtime = higher score)
24
+
25
+ ## Modal Cloud GPU Support
26
+
27
+ ```bash
28
+ GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H100 \
29
+ uv run skydiscover-run \
30
+ benchmarks/gpu_mode/grayscale/initial_program.py \
31
+ benchmarks/gpu_mode/grayscale/evaluator.py \
32
+ -c benchmarks/gpu_mode/grayscale/config.yaml \
33
+ -s [your_algorithm] -i 50
34
+ ```
benchmarks/gpu_mode/grayscale/config.yaml ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Grayscale Triton Kernel Optimization
2
+
3
+ max_iterations: 100
4
+ checkpoint_interval: 1
5
+ log_level: "INFO"
6
+
7
+ llm:
8
+ models:
9
+ - name: "gpt-5"
10
+ weight: 1.0
11
+ api_base: https://api.openai.com/v1
12
+ temperature: 1.0
13
+ # top_p: 0.95 # omitted by default; some providers (e.g. Anthropic) reject both temperature and top_p
14
+ max_tokens: 32000
15
+ timeout: 600
16
+
17
+ prompt:
18
+ system_message: |
19
+ You are an expert Triton engineer tasked with translating PyTorch code into highly optimized Triton kernel code.
20
+
21
+ You will be implementing a Grayscale conversion kernel that converts RGB images to grayscale using the luminance formula:
22
+ Y = 0.2989 * R + 0.5870 * G + 0.1140 * B
23
+
24
+ Your task:
25
+ - Implement the grayscale conversion as a highly optimized Triton kernel.
26
+ - The input is an (H, W, 3) float32 tensor and a pre-allocated (H, W) float32 output tensor.
27
+ - Your function receives `data = (rgb, output)` and should write the result into `output` and return it.
28
+
29
+ Your function should be defined as 'custom_kernel' with the following signature:
30
+ Input:
31
+ - `data`: Tuple of (rgb: torch.Tensor, output: torch.Tensor)
32
+ - rgb: Input tensor of shape [H, W, 3] (float32, contiguous)
33
+ - output: Pre-allocated output tensor of shape [H, W] (float32, contiguous)
34
+
35
+ Output:
36
+ - output: Grayscale tensor [H, W] (write in-place to the provided output tensor and return it)
37
+
38
+ Here is the reference PyTorch implementation:
39
+ ```python
40
+ import torch
41
+
42
+ # Reference code in PyTorch
43
+ def ref_kernel(data):
44
+ rgb, output = data
45
+ weights = torch.tensor([0.2989, 0.5870, 0.1140], device=rgb.device, dtype=rgb.dtype)
46
+ output[...] = torch.sum(rgb * weights, dim=-1)
47
+ return output
48
+ ```
49
+
50
+ Here is an example of a basic Triton implementation:
51
+ ```python
52
+ import torch
53
+ import triton
54
+ import triton.language as tl
55
+
56
+ @triton.jit
57
+ def grayscale_kernel(
58
+ rgb_ptr, out_ptr,
59
+ H, W,
60
+ stride_h, stride_w, stride_c,
61
+ BLOCK_SIZE: tl.constexpr,
62
+ ):
63
+ pid = tl.program_id(0)
64
+ n_pixels = H * W
65
+ block_start = pid * BLOCK_SIZE
66
+ offsets = block_start + tl.arange(0, BLOCK_SIZE)
67
+ mask = offsets < n_pixels
68
+
69
+ h_idx = offsets // W
70
+ w_idx = offsets % W
71
+
72
+ r = tl.load(rgb_ptr + h_idx * stride_h + w_idx * stride_w + 0 * stride_c, mask=mask)
73
+ g = tl.load(rgb_ptr + h_idx * stride_h + w_idx * stride_w + 1 * stride_c, mask=mask)
74
+ b = tl.load(rgb_ptr + h_idx * stride_h + w_idx * stride_w + 2 * stride_c, mask=mask)
75
+
76
+ gray = 0.2989 * r + 0.5870 * g + 0.1140 * b
77
+
78
+ out_offsets = h_idx * W + w_idx
79
+ tl.store(out_ptr + out_offsets, gray, mask=mask)
80
+
81
+ def custom_kernel(data):
82
+ rgb, output = data
83
+ H, W, C = rgb.shape
84
+ assert C == 3
85
+ rgb = rgb.contiguous()
86
+ stride_h, stride_w, stride_c = rgb.stride()
87
+ n_pixels = H * W
88
+ BLOCK_SIZE = 1024
89
+ grid = (triton.cdiv(n_pixels, BLOCK_SIZE),)
90
+ grayscale_kernel[grid](
91
+ rgb, output, H, W,
92
+ stride_h, stride_w, stride_c,
93
+ BLOCK_SIZE=BLOCK_SIZE,
94
+ )
95
+ return output
96
+ ```
97
+
98
+ To help you understand which triton version we are using, here is some example triton code for an unrelated task:
99
+ ```python
100
+ import triton
101
+ import triton.language as tl
102
+
103
+ @triton.jit
104
+ def matmul_persistent_ws_kernel(
105
+ a_ptr, b_ptr, c_ptr, M, N, K,
106
+ stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
107
+ BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
108
+ ):
109
+ pid = tl.program_id(axis=0) # async_task 0, 1, 2
110
+ num_pid_m = tl.cdiv(M, BLOCK_M) # async_task 0, 1, 2
111
+ num_pid_n = tl.cdiv(N, BLOCK_N) # async_task 0, 1, 2
112
+ pid_m = pid // num_pid_m # async_task 0, 1, 2
113
+ pid_n = pid % num_pid_n # async_task 0, 1, 2
114
+ offs_m_1 = pid_m * BLOCK_M + tl.arange(0, BLOCK_M // 2) # async_task 0, 1, 2
115
+ offs_m_2 = pid_m * BLOCK_M + tl.arange(BLOCK_M // 2, BLOCK_M) # async_task 0, 1, 2
116
+ offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_N) # async_task 0, 1, 2
117
+ offs_k = tl.arange(0, BLOCK_K) # async_task 0
118
+ a_ptrs_1 = a_ptr + (offs_m_1[:, None] * stride_am + offs_k[None, :] * stride_ak) # async_task 0
119
+ a_ptrs_2 = a_ptr + (offs_m_2[:, None] * stride_am + offs_k[None, :] * stride_ak) # async_task 0
120
+ b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) # async_task 0
121
+ acc_1 = tl.zeros((BLOCK_M // 2, BLOCK_N), dtype=tl.float32) # async_task 1
122
+ acc_1 = tl.zeros((BLOCK_M // 2, BLOCK_N), dtype=tl.float32) # async_task 2
123
+ for k in range(0, tl.cdiv(K, BLOCK_K)): # async_task 0, 1, 2
124
+ a_1 = tl.load(a_ptrs_1) # async_task 0
125
+ a_2 = tl.load(a_ptrs_2) # async_task 0
126
+ b = tl.load(b_ptrs) # async_task 0
127
+ acc_1 += tl.dot(a_1, b) # async_task 1
128
+ acc_2 += tl.dot(a_2, b) # async_task 2
129
+ a_ptrs_1 += BLOCK_K * stride_ak # async_task 0
130
+ a_ptrs_2 += BLOCK_K * stride_ak # async_task 0
131
+ b_ptrs += BLOCK_K * stride_bk # async_task 0
132
+ c_1 = acc_1.to(tl.float16) # async_task 1
133
+ c_2 = acc_2.to(tl.float16) # async_task 2
134
+ c_ptrs_1 = c_ptr_1 + stride_cm * offs_m_1[:, None] + stride_cn * offs_n[None, :] # async_task 1
135
+ c_ptrs_2 = c_ptr_2 + stride_cm * offs_m_2[:, None] + stride_cn * offs_n[None, :] # async_task 2
136
+ tl.store(c_ptrs_1, c_1) # async_task 1
137
+ tl.store(c_ptrs_2, c_2) # async_task 2
138
+ ```
139
+
140
+ A few general triton tips:
141
+ - tl.arange only takes in constexpr arguments (static or tl.constexpr)
142
+ - You cannot use continue in your kernel code
143
+ - tl.dot can only take in two input tensors
144
+ - There is no tl.mean
145
+
146
+ Here are the different configs that your kernel will be benchmarked on (optimize runtime for these):
147
+
148
+ Benchmark Cases:
149
+ - {"size": 1024} (1024x1024 RGB image)
150
+ - {"size": 2048} (2048x2048 RGB image)
151
+ - {"size": 4096} (4096x4096 RGB image)
152
+ - {"size": 8192} (8192x8192 RGB image)
153
+
154
+ Key optimization strategies to consider:
155
+ - Memory coalescing: the RGB data is (H, W, 3), so adjacent pixels in a row are stride-3 apart. Consider vectorized loads or layout transformations.
156
+ - Block size tuning: larger blocks amortize launch overhead but may reduce occupancy.
157
+ - Use of shared memory or register-level optimizations for the weighted sum.
158
+ - Vectorized loads (e.g., loading 3 floats at once per pixel).
159
+
160
+ Rules:
161
+ - The tensors arguments passed in will be already on your cuda device.
162
+ - Define all of your code in one final ```python ``` block.
163
+ - We will test the correctness of your kernel on multiple input shapes, make sure to support different potential test cases.
164
+ - Your final output must be in float32.
165
+ - You must use trition 3.3.1 and these kernels will be run on an H100.
166
+ - You do not have to implement everything in triton, you may choose to have some of the operations done in pytorch. However, you must implement at least part of the operations in a kernel.
167
+ - Include a short docstring at the top summarizing your algorithm.
168
+ evaluator:
169
+ timeout: 600
170
+ max_retries: 3
171
+ cascade_evaluation: true
172
+ cascade_thresholds: [0.4, 0.3]
173
+
174
+ diff_based_generation: true
175
+ max_solution_length: 60000
176
+ random_seed: 42
benchmarks/gpu_mode/grayscale/evaluator.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluator for Grayscale — delegates to shared evaluator."""
2
+ import os
3
+ import sys
4
+
5
+ _problem_dir = os.path.dirname(os.path.abspath(__file__))
6
+ _parent_dir = os.path.dirname(_problem_dir)
7
+
8
+ if _problem_dir not in sys.path:
9
+ sys.path.insert(0, _problem_dir)
10
+ if _parent_dir not in sys.path:
11
+ sys.path.insert(0, _parent_dir)
12
+
13
+ from shared_eval import evaluate, evaluate_stage1, evaluate_stage2
benchmarks/gpu_mode/grayscale/initial_program.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EVOLVE-BLOCK-START
2
+ """
3
+ Initial Grayscale submission with Triton kernel.
4
+ Y = 0.2989 R + 0.5870 G + 0.1140 B
5
+ """
6
+
7
+ import torch
8
+ import triton
9
+ import triton.language as tl
10
+
11
+
12
+ @triton.jit
13
+ def grayscale_kernel(
14
+ rgb_ptr, out_ptr,
15
+ H, W,
16
+ stride_h, stride_w, stride_c,
17
+ BLOCK_SIZE: tl.constexpr,
18
+ ):
19
+ pid = tl.program_id(0)
20
+ n_pixels = H * W
21
+ block_start = pid * BLOCK_SIZE
22
+ offsets = block_start + tl.arange(0, BLOCK_SIZE)
23
+ mask = offsets < n_pixels
24
+
25
+ h_idx = offsets // W
26
+ w_idx = offsets % W
27
+
28
+ r_ptr = rgb_ptr + h_idx * stride_h + w_idx * stride_w + 0 * stride_c
29
+ g_ptr = rgb_ptr + h_idx * stride_h + w_idx * stride_w + 1 * stride_c
30
+ b_ptr = rgb_ptr + h_idx * stride_h + w_idx * stride_w + 2 * stride_c
31
+
32
+ r = tl.load(r_ptr, mask=mask)
33
+ g = tl.load(g_ptr, mask=mask)
34
+ b = tl.load(b_ptr, mask=mask)
35
+
36
+ gray = 0.2989 * r + 0.5870 * g + 0.1140 * b
37
+
38
+ out_offsets = h_idx * W + w_idx
39
+ tl.store(out_ptr + out_offsets, gray, mask=mask)
40
+
41
+
42
+ def custom_kernel(data):
43
+ rgb, output = data
44
+ H, W, C = rgb.shape
45
+ assert C == 3
46
+ rgb = rgb.contiguous()
47
+ stride_h, stride_w, stride_c = rgb.stride()
48
+ n_pixels = H * W
49
+ BLOCK_SIZE = 1024
50
+ grid = (triton.cdiv(n_pixels, BLOCK_SIZE),)
51
+ grayscale_kernel[grid](
52
+ rgb, output, H, W,
53
+ stride_h, stride_w, stride_c,
54
+ BLOCK_SIZE=BLOCK_SIZE,
55
+ )
56
+ return output
57
+ # EVOLVE-BLOCK-END
benchmarks/gpu_mode/grayscale/reference.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Reference implementation for Grayscale Triton kernel.
3
+ Y = 0.2989 R + 0.5870 G + 0.1140 B
4
+
5
+ Input: (H, W, 3) float32 RGB tensor
6
+ Output: (H, W) float32 grayscale tensor
7
+ """
8
+
9
+ import torch
10
+
11
+ # ---------------------------------------------------------------------------
12
+ # Scoring and benchmark configuration (read by shared_eval.py)
13
+ # ---------------------------------------------------------------------------
14
+
15
+ SCORE_SCALE = 3000.0
16
+
17
+ # grayscale uses CUDA events timing, 0.1% rel error, 120s wall clock timeout
18
+ BENCH_USE_CUDA_EVENTS = True
19
+ BENCH_REL_ERROR = 0.001
20
+ BENCH_WALL_TIMEOUT_NS = 120e9
21
+ BENCH_NO_GRAD = False
22
+ BENCH_MAX_REPEATS = 100
23
+ BENCH_MAX_TIME_NS = 10e9
24
+ BENCH_WARMUP_STYLE = 'tiny_benchmark'
25
+
26
+ # ---------------------------------------------------------------------------
27
+ # Test / benchmark cases
28
+ # ---------------------------------------------------------------------------
29
+
30
+ TEST_CASES = [
31
+ {"size": 256, "seed": 42},
32
+ {"size": 512, "seed": 123},
33
+ {"size": 1024, "seed": 456},
34
+ {"size": 2048, "seed": 789},
35
+ ]
36
+
37
+ BENCHMARK_CASES = [
38
+ {"size": 1024, "seed": 1001},
39
+ {"size": 2048, "seed": 1002},
40
+ {"size": 4096, "seed": 1003},
41
+ {"size": 8192, "seed": 1004},
42
+ ]
43
+
44
+ # ---------------------------------------------------------------------------
45
+ # Reference kernel
46
+ # ---------------------------------------------------------------------------
47
+
48
+
49
+ def ref_kernel(data):
50
+ """Reference: Y = 0.2989 R + 0.5870 G + 0.1140 B"""
51
+ rgb, output = data
52
+ weights = torch.tensor([0.2989, 0.5870, 0.1140], device=rgb.device, dtype=rgb.dtype)
53
+ output[...] = torch.sum(rgb * weights, dim=-1)
54
+ return output
55
+
56
+
57
+ def generate_input(size, seed):
58
+ gen = torch.Generator(device="cuda")
59
+ gen.manual_seed(seed)
60
+ x = torch.rand(size, size, 3, device="cuda", dtype=torch.float32, generator=gen).contiguous()
61
+ y = torch.empty(size, size, device="cuda", dtype=torch.float32).contiguous()
62
+ return x, y
63
+
64
+
65
+ def check_implementation(data, submission_output, rtol=1e-4, atol=1e-4):
66
+ ref_output = ref_kernel(data)
67
+ if submission_output.shape != ref_output.shape:
68
+ return False, f"Shape mismatch: expected {ref_output.shape}, got {submission_output.shape}"
69
+ if torch.allclose(submission_output, ref_output, rtol=rtol, atol=atol):
70
+ return True, "Match"
71
+ diff = torch.abs(submission_output.float() - ref_output.float())
72
+ return False, f"Output mismatch: max_diff={diff.max().item():.6f}"
73
+
74
+
75
+ # ---------------------------------------------------------------------------
76
+ # Self-contained reference code for Modal remote execution
77
+ # ---------------------------------------------------------------------------
78
+
79
+ MODAL_REFERENCE_CODE = r'''
80
+ import torch
81
+
82
+ def ref_kernel(data):
83
+ rgb, output = data
84
+ weights = torch.tensor([0.2989, 0.5870, 0.1140], device=rgb.device, dtype=rgb.dtype)
85
+ output[...] = torch.sum(rgb * weights, dim=-1)
86
+ return output
87
+
88
+ def generate_input(size, seed):
89
+ gen = torch.Generator(device="cuda")
90
+ gen.manual_seed(seed)
91
+ x = torch.rand(size, size, 3, device="cuda", dtype=torch.float32, generator=gen).contiguous()
92
+ y = torch.empty(size, size, device="cuda", dtype=torch.float32).contiguous()
93
+ return x, y
94
+
95
+ def check_implementation(data, submission_output, rtol=1e-4, atol=1e-4):
96
+ ref_output = ref_kernel(data)
97
+ if submission_output.shape != ref_output.shape:
98
+ return False, f"Shape mismatch: expected {ref_output.shape}, got {submission_output.shape}"
99
+ if torch.allclose(submission_output, ref_output, rtol=rtol, atol=atol):
100
+ return True, "Match"
101
+ diff = torch.abs(submission_output.float() - ref_output.float())
102
+ return False, f"Output mismatch: max_diff={diff.max().item():.6f}"
103
+ '''
benchmarks/gpu_mode/grayscale/requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ triton
2
+ torch
benchmarks/gpu_mode/mla_decode/README.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Multi-Head Latent Attention (MLA) Decode
2
+
3
+ Evolve a Triton kernel for the MLA decode operator using SkyDiscover.
4
+
5
+ Core attention mechanism from DeepSeek-V2/V3, used for efficient inference with compressed KV cache via LoRA projections and RoPE.
6
+
7
+ ## Quick Start
8
+
9
+ From the repo root:
10
+
11
+ ```bash
12
+ uv run skydiscover-run \
13
+ benchmarks/gpu_mode/mla_decode/initial_program.py \
14
+ benchmarks/gpu_mode/mla_decode/evaluator.py \
15
+ -c benchmarks/gpu_mode/mla_decode/config.yaml \
16
+ -s [your_algorithm] -i 50
17
+ ```
18
+
19
+ ## Scoring
20
+
21
+ - **Correctness:** Must match reference MLA output (rtol=0.06, atol=0.06 in bfloat16)
22
+ - **Score:** `SCORE_SCALE / geom_mean_us` where `SCORE_SCALE = 3000.0`
23
+ - Higher is better (faster runtime = higher score)
24
+
25
+ ## Modal Cloud GPU Support
26
+
27
+ **Note:** This benchmark requires an H200 GPU (141GB VRAM). The H100 (80GB) does not have enough memory.
28
+
29
+ ```bash
30
+ GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H200 \
31
+ uv run skydiscover-run \
32
+ benchmarks/gpu_mode/mla_decode/initial_program.py \
33
+ benchmarks/gpu_mode/mla_decode/evaluator.py \
34
+ -c benchmarks/gpu_mode/mla_decode/config.yaml \
35
+ -s [your_algorithm] -i 50
36
+ ```
benchmarks/gpu_mode/mla_decode/evaluator.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluator for MLA Decode — delegates to shared evaluator."""
2
+ import os
3
+ import sys
4
+
5
+ _problem_dir = os.path.dirname(os.path.abspath(__file__))
6
+ _parent_dir = os.path.dirname(_problem_dir)
7
+
8
+ if _problem_dir not in sys.path:
9
+ sys.path.insert(0, _problem_dir)
10
+ if _parent_dir not in sys.path:
11
+ sys.path.insert(0, _parent_dir)
12
+
13
+ from shared_eval import evaluate, evaluate_stage1, evaluate_stage2
benchmarks/gpu_mode/modal_eval.py ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Shared Modal app for evaluating Triton kernels on cloud GPUs.
3
+ Scoring: score = score_scale / geom_mean_runtime_us.
4
+
5
+ Usage:
6
+ Set GPUMODE_USE_MODAL=true and GPUMODE_MODAL_GPU=H100 (or A100, L40S, T4, H200)
7
+ in environment variables, then call eval functions from evaluators.
8
+ """
9
+
10
+ import modal
11
+
12
+ app = modal.App("gpu-mode-triton-eval")
13
+
14
+ cuda_image = (
15
+ modal.Image.debian_slim(python_version="3.11")
16
+ .pip_install(
17
+ "torch>=2.2.0",
18
+ "triton>=3.0.0",
19
+ "numpy",
20
+ )
21
+ )
22
+
23
+
24
+ def _eval_triton_impl(
25
+ submission_code: str,
26
+ reference_code: str,
27
+ test_cases: list,
28
+ benchmark_cases: list,
29
+ score_scale: float = 3000.0,
30
+ bench_use_cuda_events: bool = True,
31
+ bench_rel_error: float = 0.001,
32
+ bench_wall_timeout_ns: float = 120e9,
33
+ bench_no_grad: bool = False,
34
+ bench_max_repeats: int = 100,
35
+ bench_max_time_ns: float = 10e9,
36
+ bench_warmup_style: str = 'tiny_benchmark',
37
+ ) -> dict:
38
+ """
39
+ Core evaluation logic that runs inside a Modal GPU container.
40
+
41
+ Returns dict with: combined_score, correctness, geom_mean_us, error
42
+ """
43
+ import os
44
+ import sys
45
+ import gc
46
+ import copy
47
+ import math
48
+ import time
49
+ import contextlib
50
+ import dataclasses
51
+ import tempfile
52
+
53
+ # Help with memory fragmentation for large models (MLA bs=128)
54
+ os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")
55
+ import importlib.util
56
+ import traceback
57
+
58
+ import torch
59
+ import torch.cuda
60
+
61
+ def clone_data(data):
62
+ if isinstance(data, tuple):
63
+ return tuple(clone_data(x) for x in data)
64
+ elif isinstance(data, list):
65
+ return [clone_data(x) for x in data]
66
+ elif isinstance(data, dict):
67
+ return {k: clone_data(v) for k, v in data.items()}
68
+ elif isinstance(data, torch.Tensor):
69
+ return data.clone()
70
+ elif dataclasses.is_dataclass(data) and not isinstance(data, type):
71
+ fields = {f.name: clone_data(getattr(data, f.name)) for f in dataclasses.fields(data)}
72
+ return type(data)(**fields)
73
+ elif isinstance(data, torch.nn.Module):
74
+ cloned = copy.deepcopy(data)
75
+ if hasattr(data, 'seq_len'):
76
+ cloned.seq_len = data.seq_len
77
+ return cloned
78
+ return data
79
+
80
+ def stats(durations):
81
+ n = len(durations)
82
+ avg = sum(durations) / n
83
+ if n > 1:
84
+ var = sum((x - avg) ** 2 for x in durations) / (n - 1)
85
+ std = math.sqrt(var)
86
+ err = std / math.sqrt(n)
87
+ else:
88
+ std, err = 0.0, 0.0
89
+ return {"runs": n, "mean": avg, "std": std, "err": err}
90
+
91
+ tmpdir = tempfile.mkdtemp()
92
+
93
+ try:
94
+ ref_path = os.path.join(tmpdir, "reference.py")
95
+ sub_path = os.path.join(tmpdir, "submission.py")
96
+
97
+ with open(ref_path, "w") as f:
98
+ f.write(reference_code)
99
+ with open(sub_path, "w") as f:
100
+ f.write(submission_code)
101
+
102
+ sys.path.insert(0, tmpdir)
103
+
104
+ spec = importlib.util.spec_from_file_location("reference", ref_path)
105
+ reference = importlib.util.module_from_spec(spec)
106
+ spec.loader.exec_module(reference)
107
+
108
+ generate_input = reference.generate_input
109
+ check_implementation = reference.check_implementation
110
+
111
+ spec = importlib.util.spec_from_file_location("submission", sub_path)
112
+ submission = importlib.util.module_from_spec(spec)
113
+ spec.loader.exec_module(submission)
114
+ custom_kernel = submission.custom_kernel
115
+
116
+ # Correctness tests (use no_grad to reduce memory from autograd)
117
+ for i, test_args in enumerate(test_cases):
118
+ data = generate_input(**test_args)
119
+ data_copy = clone_data(data)
120
+ torch.cuda.synchronize()
121
+ with torch.no_grad():
122
+ output = custom_kernel(data)
123
+ torch.cuda.synchronize()
124
+ # Aggressively free GPU memory before ref kernel runs
125
+ del data
126
+ gc.collect()
127
+ torch.cuda.empty_cache()
128
+ passed, msg = check_implementation(data_copy, output)
129
+ del data_copy, output
130
+ gc.collect()
131
+ torch.cuda.empty_cache()
132
+ if not passed:
133
+ return {"combined_score": 0.0, "correctness": 0.0,
134
+ "error": f"Test {i} failed: {msg}"}
135
+
136
+ # Warmup
137
+ wb = benchmark_cases[0]
138
+ if bench_warmup_style == 'timed_calls':
139
+ wdata = generate_input(**wb)
140
+ start = time.perf_counter()
141
+ while time.perf_counter() - start < 0.2:
142
+ custom_kernel(wdata)
143
+ torch.cuda.synchronize()
144
+ else:
145
+ # tiny_benchmark: quick run to trigger compilation
146
+ wdata = generate_input(**wb)
147
+ for _ in range(3):
148
+ custom_kernel(wdata)
149
+ torch.cuda.synchronize()
150
+
151
+ # Benchmarks — collect mean runtimes in nanoseconds
152
+ ctx = torch.no_grad() if bench_no_grad else contextlib.nullcontext()
153
+ bench_means_ns = []
154
+
155
+ for bench_args in benchmark_cases:
156
+ data = generate_input(**bench_args)
157
+ data_copy = clone_data(data)
158
+
159
+ # Correctness check
160
+ with ctx:
161
+ output = custom_kernel(data)
162
+ torch.cuda.synchronize()
163
+ # Aggressively free GPU memory before ref kernel runs
164
+ del data
165
+ gc.collect()
166
+ torch.cuda.empty_cache()
167
+ passed, msg = check_implementation(data_copy, output)
168
+ del data_copy, output
169
+ gc.collect()
170
+ torch.cuda.empty_cache()
171
+ if not passed:
172
+ return {"combined_score": 0.0, "correctness": 1.0,
173
+ "error": f"Benchmark correctness: {msg}"}
174
+
175
+ # Regenerate data for timed runs (was freed during correctness check)
176
+ data = generate_input(**bench_args)
177
+
178
+ # Timed runs
179
+ durations_ns = []
180
+ bm_start = time.perf_counter_ns()
181
+
182
+ with ctx:
183
+ for t in range(bench_max_repeats):
184
+ torch.cuda.synchronize()
185
+
186
+ if bench_use_cuda_events:
187
+ s = torch.cuda.Event(enable_timing=True)
188
+ e = torch.cuda.Event(enable_timing=True)
189
+ s.record()
190
+ output = custom_kernel(data)
191
+ e.record()
192
+ torch.cuda.synchronize()
193
+ duration_ns = s.elapsed_time(e) * 1e6 # ms -> ns
194
+ else:
195
+ start_ns = time.perf_counter_ns()
196
+ output = custom_kernel(data)
197
+ torch.cuda.synchronize()
198
+ duration_ns = time.perf_counter_ns() - start_ns
199
+
200
+ del output
201
+ durations_ns.append(duration_ns)
202
+
203
+ if t > 1:
204
+ st = stats(durations_ns)
205
+ if st["mean"] > 0 and st["err"] / st["mean"] < bench_rel_error:
206
+ break
207
+ if st["mean"] * st["runs"] > bench_max_time_ns:
208
+ break
209
+ if bench_wall_timeout_ns is not None and \
210
+ (time.perf_counter_ns() - bm_start) > bench_wall_timeout_ns:
211
+ break
212
+
213
+ bench_means_ns.append(stats(durations_ns)["mean"])
214
+
215
+ # Scoring: geometric mean → microseconds → score
216
+ means_seconds = [ns / 1e9 for ns in bench_means_ns]
217
+ geom_mean_s = math.pow(math.prod(means_seconds), 1.0 / len(means_seconds))
218
+ geom_mean_us = geom_mean_s * 1e6
219
+ score = score_scale / geom_mean_us
220
+
221
+ bench_means_us = [ns / 1e3 for ns in bench_means_ns]
222
+ return {
223
+ "combined_score": score,
224
+ "correctness": 1.0,
225
+ "geom_mean_us": geom_mean_us,
226
+ "bench_means_us": bench_means_us,
227
+ }
228
+ except Exception as e:
229
+ return {"combined_score": 0.0, "correctness": 0.0,
230
+ "error": f"{e}\n{traceback.format_exc()}"}
231
+ finally:
232
+ sys.path.remove(tmpdir)
233
+ import shutil
234
+ shutil.rmtree(tmpdir, ignore_errors=True)
235
+
236
+
237
+ @app.function(image=cuda_image, gpu="H100", timeout=600)
238
+ def eval_triton_h100(**kwargs) -> dict:
239
+ return _eval_triton_impl(**kwargs)
240
+
241
+
242
+ @app.function(image=cuda_image, gpu="A100", timeout=600)
243
+ def eval_triton_a100(**kwargs) -> dict:
244
+ return _eval_triton_impl(**kwargs)
245
+
246
+
247
+ @app.function(image=cuda_image, gpu="L40S", timeout=600)
248
+ def eval_triton_l40s(**kwargs) -> dict:
249
+ return _eval_triton_impl(**kwargs)
250
+
251
+
252
+ @app.function(image=cuda_image, gpu="T4", timeout=600)
253
+ def eval_triton_t4(**kwargs) -> dict:
254
+ return _eval_triton_impl(**kwargs)
255
+
256
+
257
+ @app.function(image=cuda_image, gpu="H200", timeout=600)
258
+ def eval_triton_h200(**kwargs) -> dict:
259
+ return _eval_triton_impl(**kwargs)
benchmarks/gpu_mode/shared_eval.py ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Shared evaluator for GPU Mode Triton kernel optimization.
3
+
4
+ No @triton.jit requirement — pure PyTorch submissions are allowed.
5
+ Supports local GPU and Modal cloud GPU evaluation.
6
+ Set GPUMODE_USE_MODAL=true and GPUMODE_MODAL_GPU=H100 for Modal.
7
+
8
+ Scoring: combined_score = SCORE_SCALE / geom_mean_us (higher is better).
9
+ The geom_mean_us metric is also reported for absolute runtime tracking.
10
+
11
+ Each problem provides a reference.py module with:
12
+ - ref_kernel(data)
13
+ - generate_input(**kwargs)
14
+ - check_implementation(data, output) -> (bool, str)
15
+ - TEST_CASES: list of dicts
16
+ - BENCHMARK_CASES: list of dicts
17
+ - SCORE_SCALE: float
18
+
19
+ Optional benchmark configuration in reference.py:
20
+ - BENCH_USE_CUDA_EVENTS: bool (default True)
21
+ - BENCH_REL_ERROR: float (default 0.001)
22
+ - BENCH_WALL_TIMEOUT_NS: float or None (default 120e9)
23
+ - BENCH_NO_GRAD: bool (default False)
24
+ - BENCH_MAX_REPEATS: int (default 100)
25
+ - BENCH_MAX_TIME_NS: float (default 10e9)
26
+ - BENCH_WARMUP_STYLE: str ('tiny_benchmark' or 'timed_calls', default 'tiny_benchmark')
27
+ """
28
+
29
+ import os
30
+ import sys
31
+ import copy
32
+ import time
33
+ import math
34
+ import contextlib
35
+ import dataclasses
36
+ import traceback
37
+ import importlib.util
38
+
39
+ import torch
40
+ import torch.cuda
41
+
42
+ from skydiscover.evaluation.evaluation_result import EvaluationResult
43
+
44
+ # Import problem-specific reference (the problem dir is already on sys.path
45
+ # because SkyDiscover adds the evaluator file's directory before loading it).
46
+ import reference
47
+
48
+ # ---------------------------------------------------------------------------
49
+ # Environment configuration
50
+ # ---------------------------------------------------------------------------
51
+
52
+ USE_MODAL = os.environ.get("GPUMODE_USE_MODAL", "false").lower() == "true"
53
+ MODAL_GPU = os.environ.get("GPUMODE_MODAL_GPU", "H100")
54
+
55
+ # Read benchmark configuration from reference module with defaults
56
+ SCORE_SCALE = getattr(reference, 'SCORE_SCALE', 3000.0)
57
+ BENCH_USE_CUDA_EVENTS = getattr(reference, 'BENCH_USE_CUDA_EVENTS', True)
58
+ BENCH_REL_ERROR = getattr(reference, 'BENCH_REL_ERROR', 0.001)
59
+ BENCH_WALL_TIMEOUT_NS = getattr(reference, 'BENCH_WALL_TIMEOUT_NS', 120e9)
60
+ BENCH_NO_GRAD = getattr(reference, 'BENCH_NO_GRAD', False)
61
+ BENCH_MAX_REPEATS = getattr(reference, 'BENCH_MAX_REPEATS', 100)
62
+ BENCH_MAX_TIME_NS = getattr(reference, 'BENCH_MAX_TIME_NS', 10e9)
63
+ BENCH_WARMUP_STYLE = getattr(reference, 'BENCH_WARMUP_STYLE', 'tiny_benchmark')
64
+
65
+ # ---------------------------------------------------------------------------
66
+ # Helpers
67
+ # ---------------------------------------------------------------------------
68
+
69
+
70
+ def _clone(data):
71
+ """Recursively clone data, handling tensors, dataclasses, and nn.Modules."""
72
+ if isinstance(data, tuple):
73
+ return tuple(_clone(x) for x in data)
74
+ if isinstance(data, list):
75
+ return [_clone(x) for x in data]
76
+ if isinstance(data, dict):
77
+ return {k: _clone(v) for k, v in data.items()}
78
+ if isinstance(data, torch.Tensor):
79
+ return data.clone()
80
+ if dataclasses.is_dataclass(data) and not isinstance(data, type):
81
+ fields = {f.name: _clone(getattr(data, f.name)) for f in dataclasses.fields(data)}
82
+ return type(data)(**fields)
83
+ if isinstance(data, torch.nn.Module):
84
+ cloned = copy.deepcopy(data)
85
+ if hasattr(data, 'seq_len'):
86
+ cloned.seq_len = data.seq_len
87
+ return cloned
88
+ return data
89
+
90
+
91
+ def _stats(durations):
92
+ """Compute statistics from a list of durations (in nanoseconds)."""
93
+ n = len(durations)
94
+ avg = sum(durations) / n
95
+ if n > 1:
96
+ var = sum((x - avg) ** 2 for x in durations) / (n - 1)
97
+ std = math.sqrt(var)
98
+ err = std / math.sqrt(n)
99
+ else:
100
+ std, err = 0.0, 0.0
101
+ return {"runs": n, "mean": avg, "std": std, "err": err}
102
+
103
+
104
+ def _warmup(kernel_fn, bench_args):
105
+ """Warmup the kernel to trigger Triton compilation."""
106
+ if BENCH_WARMUP_STYLE == 'timed_calls':
107
+ # MLA-style: run repeatedly for 200ms
108
+ data = reference.generate_input(**bench_args)
109
+ start = time.perf_counter()
110
+ while time.perf_counter() - start < 0.2:
111
+ kernel_fn(data)
112
+ torch.cuda.synchronize()
113
+ else:
114
+ # trimul-style: run first benchmark with tiny time budget (10ms)
115
+ _bench_single(kernel_fn, bench_args, max_time_ns=10e7)
116
+
117
+
118
+ def _bench_single(kernel_fn, bench_args, max_time_ns=None):
119
+ """Benchmark a kernel on a single case.
120
+
121
+ Returns (stats_dict_or_None, error_str_or_None).
122
+ Stats dict has durations in nanoseconds.
123
+ """
124
+ if max_time_ns is None:
125
+ max_time_ns = BENCH_MAX_TIME_NS
126
+
127
+ data = reference.generate_input(**bench_args)
128
+ data_copy = _clone(data)
129
+
130
+ # Correctness check first
131
+ ctx = torch.no_grad() if BENCH_NO_GRAD else contextlib.nullcontext()
132
+ with ctx:
133
+ output = kernel_fn(data)
134
+ torch.cuda.synchronize()
135
+ passed, msg = reference.check_implementation(data_copy, output)
136
+ if not passed:
137
+ return None, f"Benchmark correctness: {msg}"
138
+ del output
139
+
140
+ # Timed runs — durations in nanoseconds
141
+ durations_ns = []
142
+ bm_start = time.perf_counter_ns()
143
+
144
+ with ctx:
145
+ for i in range(BENCH_MAX_REPEATS):
146
+ torch.cuda.synchronize()
147
+
148
+ if BENCH_USE_CUDA_EVENTS:
149
+ s = torch.cuda.Event(enable_timing=True)
150
+ e = torch.cuda.Event(enable_timing=True)
151
+ s.record()
152
+ output = kernel_fn(data)
153
+ e.record()
154
+ torch.cuda.synchronize()
155
+ duration_ns = s.elapsed_time(e) * 1e6 # ms -> ns
156
+ else:
157
+ start_ns = time.perf_counter_ns()
158
+ output = kernel_fn(data)
159
+ torch.cuda.synchronize()
160
+ duration_ns = time.perf_counter_ns() - start_ns
161
+
162
+ del output
163
+ durations_ns.append(duration_ns)
164
+
165
+ if i > 1:
166
+ st = _stats(durations_ns)
167
+ if st["mean"] > 0 and st["err"] / st["mean"] < BENCH_REL_ERROR:
168
+ break
169
+ if st["mean"] * st["runs"] > max_time_ns:
170
+ break
171
+ if BENCH_WALL_TIMEOUT_NS is not None and \
172
+ (time.perf_counter_ns() - bm_start) > BENCH_WALL_TIMEOUT_NS:
173
+ break
174
+
175
+ return _stats(durations_ns), None
176
+
177
+
178
+ # ---------------------------------------------------------------------------
179
+ # Modal path
180
+ # ---------------------------------------------------------------------------
181
+
182
+
183
+ def _evaluate_modal(submission_code):
184
+ parent_dir = os.path.dirname(os.path.abspath(__file__))
185
+ if parent_dir not in sys.path:
186
+ sys.path.insert(0, parent_dir)
187
+ from modal_eval import (
188
+ eval_triton_h100, eval_triton_a100, eval_triton_l40s, eval_triton_t4,
189
+ eval_triton_h200, app as modal_app,
190
+ )
191
+
192
+ gpu_fns = {
193
+ "H100": eval_triton_h100,
194
+ "A100": eval_triton_a100,
195
+ "L40S": eval_triton_l40s,
196
+ "T4": eval_triton_t4,
197
+ "H200": eval_triton_h200,
198
+ }
199
+ eval_fn = gpu_fns.get(MODAL_GPU, eval_triton_h100)
200
+
201
+ ref_code = getattr(reference, 'MODAL_REFERENCE_CODE', None)
202
+ if ref_code is None:
203
+ return EvaluationResult(
204
+ metrics={"combined_score": 0.0, "correctness": 0.0},
205
+ artifacts={"error": "MODAL_REFERENCE_CODE not defined in reference.py",
206
+ "failure_stage": "modal_setup"},
207
+ )
208
+
209
+ with modal_app.run():
210
+ result = eval_fn.remote(
211
+ submission_code=submission_code,
212
+ reference_code=ref_code,
213
+ test_cases=reference.TEST_CASES,
214
+ benchmark_cases=reference.BENCHMARK_CASES,
215
+ score_scale=SCORE_SCALE,
216
+ bench_use_cuda_events=BENCH_USE_CUDA_EVENTS,
217
+ bench_rel_error=BENCH_REL_ERROR,
218
+ bench_wall_timeout_ns=BENCH_WALL_TIMEOUT_NS,
219
+ bench_no_grad=BENCH_NO_GRAD,
220
+ bench_max_repeats=BENCH_MAX_REPEATS,
221
+ bench_max_time_ns=BENCH_MAX_TIME_NS,
222
+ bench_warmup_style=BENCH_WARMUP_STYLE,
223
+ )
224
+
225
+ if isinstance(result, dict):
226
+ error = result.get("error")
227
+ score = float(result.get("combined_score", 0.0))
228
+ metrics = {"combined_score": score, "correctness": float(result.get("correctness", 0.0))}
229
+ if "geom_mean_us" in result:
230
+ metrics["geom_mean_us"] = float(result["geom_mean_us"])
231
+ artifacts = {}
232
+ if error:
233
+ artifacts["error"] = str(error)
234
+ artifacts["failure_stage"] = "modal_eval"
235
+ if "bench_means_us" in result:
236
+ for i, us in enumerate(result["bench_means_us"]):
237
+ artifacts[f"bench_{i}_mean_us"] = f"{us:.2f}"
238
+ artifacts["hardware"] = MODAL_GPU
239
+ return EvaluationResult(metrics=metrics, artifacts=artifacts)
240
+
241
+ return EvaluationResult(
242
+ metrics={"combined_score": 0.0, "correctness": 0.0},
243
+ artifacts={"error": "Modal returned unexpected type", "failure_stage": "modal_eval"},
244
+ )
245
+
246
+
247
+ # ---------------------------------------------------------------------------
248
+ # Local path
249
+ # ---------------------------------------------------------------------------
250
+
251
+
252
+ def _evaluate_local(program_path):
253
+ try:
254
+ spec = importlib.util.spec_from_file_location("submission", program_path)
255
+ mod = importlib.util.module_from_spec(spec)
256
+ sys.modules["submission"] = mod
257
+ spec.loader.exec_module(mod)
258
+ custom_kernel = mod.custom_kernel
259
+ except Exception as exc:
260
+ return EvaluationResult(
261
+ metrics={"combined_score": 0.0, "correctness": 0.0},
262
+ artifacts={
263
+ "error": f"Failed to load submission: {exc}",
264
+ "traceback": traceback.format_exc(),
265
+ "failure_stage": "import",
266
+ },
267
+ )
268
+
269
+ # Correctness
270
+ for i, tc in enumerate(reference.TEST_CASES):
271
+ try:
272
+ data = reference.generate_input(**tc)
273
+ data_copy = _clone(data)
274
+ torch.cuda.synchronize()
275
+ output = custom_kernel(data)
276
+ torch.cuda.synchronize()
277
+ passed, msg = reference.check_implementation(data_copy, output)
278
+ if not passed:
279
+ return EvaluationResult(
280
+ metrics={"combined_score": 0.0, "correctness": 0.0},
281
+ artifacts={
282
+ "error": f"Test {i} failed: {msg}",
283
+ "failure_stage": "correctness",
284
+ "test_index": str(i),
285
+ },
286
+ )
287
+ except Exception as exc:
288
+ return EvaluationResult(
289
+ metrics={"combined_score": 0.0, "correctness": 0.0},
290
+ artifacts={
291
+ "error": f"Test {i} error: {exc}",
292
+ "traceback": traceback.format_exc(),
293
+ "failure_stage": "correctness",
294
+ "test_index": str(i),
295
+ },
296
+ )
297
+
298
+ # Warmup
299
+ _warmup(custom_kernel, reference.BENCHMARK_CASES[0])
300
+
301
+ # Benchmarks — collect mean runtimes in nanoseconds
302
+ bench_means_ns = []
303
+ for bench_args in reference.BENCHMARK_CASES:
304
+ st, err = _bench_single(custom_kernel, bench_args)
305
+ if err:
306
+ return EvaluationResult(
307
+ metrics={"combined_score": 0.0, "correctness": 1.0},
308
+ artifacts={"error": err, "failure_stage": "benchmark"},
309
+ )
310
+ bench_means_ns.append(st["mean"])
311
+
312
+ # Scoring: geometric mean of benchmark means → microseconds → score
313
+ means_seconds = [ns / 1e9 for ns in bench_means_ns]
314
+ geom_mean_s = math.pow(math.prod(means_seconds), 1.0 / len(means_seconds))
315
+ geom_mean_us = geom_mean_s * 1e6
316
+ score = SCORE_SCALE / geom_mean_us
317
+
318
+ metrics = {
319
+ "combined_score": score,
320
+ "correctness": 1.0,
321
+ "geom_mean_us": geom_mean_us,
322
+ }
323
+ artifacts = {
324
+ "hardware": "local",
325
+ }
326
+ for i, ns in enumerate(bench_means_ns):
327
+ artifacts[f"bench_{i}_mean_us"] = f"{ns / 1e3:.2f}"
328
+
329
+ return EvaluationResult(
330
+ metrics=metrics,
331
+ artifacts=artifacts,
332
+ )
333
+
334
+
335
+ # ---------------------------------------------------------------------------
336
+ # Public API (used by SkyDiscover)
337
+ # ---------------------------------------------------------------------------
338
+
339
+
340
+ def evaluate(program_path):
341
+ try:
342
+ with open(program_path, "r") as f:
343
+ code = f.read()
344
+ except Exception as exc:
345
+ return EvaluationResult(
346
+ metrics={"combined_score": 0.0, "correctness": 0.0},
347
+ artifacts={"error": f"Failed to read file: {exc}", "failure_stage": "file_read"},
348
+ )
349
+
350
+ if USE_MODAL:
351
+ try:
352
+ return _evaluate_modal(code)
353
+ except Exception as exc:
354
+ return EvaluationResult(
355
+ metrics={"combined_score": 0.0, "correctness": 0.0},
356
+ artifacts={
357
+ "error": f"Modal evaluation failed: {exc}",
358
+ "traceback": traceback.format_exc(),
359
+ "failure_stage": "modal_eval",
360
+ },
361
+ )
362
+
363
+ return _evaluate_local(program_path)
364
+
365
+
366
+ def evaluate_stage1(program_path):
367
+ try:
368
+ with open(program_path, "r") as f:
369
+ code = f.read()
370
+ except Exception as exc:
371
+ return EvaluationResult(
372
+ metrics={"combined_score": 0.0, "stage1_passed": 0.0},
373
+ artifacts={"error": f"Failed to read file: {exc}", "failure_stage": "file_read"},
374
+ )
375
+
376
+ if "custom_kernel" not in code:
377
+ return EvaluationResult(
378
+ metrics={"combined_score": 0.0, "stage1_passed": 0.0},
379
+ artifacts={"error": "Missing custom_kernel function", "failure_stage": "validation"},
380
+ )
381
+
382
+ try:
383
+ compile(code, program_path, "exec")
384
+ except SyntaxError as exc:
385
+ return EvaluationResult(
386
+ metrics={"combined_score": 0.0, "stage1_passed": 0.0},
387
+ artifacts={
388
+ "error": f"Syntax error at line {exc.lineno}: {exc.msg}",
389
+ "failure_stage": "syntax_check",
390
+ },
391
+ )
392
+
393
+ # When using Modal, skip local import check (triton may not be installed locally).
394
+ if not USE_MODAL:
395
+ try:
396
+ spec = importlib.util.spec_from_file_location("submission_check", program_path)
397
+ mod = importlib.util.module_from_spec(spec)
398
+ spec.loader.exec_module(mod)
399
+ if not hasattr(mod, "custom_kernel"):
400
+ return EvaluationResult(
401
+ metrics={"combined_score": 0.0, "stage1_passed": 0.0},
402
+ artifacts={"error": "custom_kernel not found after import", "failure_stage": "import"},
403
+ )
404
+ except Exception as exc:
405
+ return EvaluationResult(
406
+ metrics={"combined_score": 0.0, "stage1_passed": 0.0},
407
+ artifacts={
408
+ "error": f"Import failed: {exc}",
409
+ "traceback": traceback.format_exc(),
410
+ "failure_stage": "import",
411
+ },
412
+ )
413
+
414
+ return EvaluationResult(
415
+ metrics={"combined_score": 0.5, "stage1_passed": 1.0},
416
+ artifacts={},
417
+ )
418
+
419
+
420
+ def evaluate_stage2(program_path):
421
+ return evaluate(program_path)
benchmarks/gpu_mode/trimul/README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Triangle Multiplicative Update (TriMul)
2
+
3
+ Evolve a Triton kernel for the TriMul operator using SkyDiscover.
4
+
5
+ Core operation for AlphaFold3, Chai, Protenix protein structure models.
6
+
7
+ ## Quick Start
8
+
9
+ From the repo root:
10
+
11
+ ```bash
12
+ uv run skydiscover-run \
13
+ benchmarks/gpu_mode/trimul/initial_program.py \
14
+ benchmarks/gpu_mode/trimul/evaluator.py \
15
+ -c benchmarks/gpu_mode/trimul/config.yaml \
16
+ -s [your_algorithm] -i 50
17
+ ```
18
+
19
+ ## Scoring
20
+
21
+ - **Correctness:** Must match reference output (rtol=0.02, atol=0.02 vs PyTorch reference)
22
+ - **Score:** `SCORE_SCALE / geom_mean_us` where `SCORE_SCALE = 3000.0`
23
+ - Higher is better (faster runtime = higher score)
24
+
25
+ ## Modal Cloud GPU Support
26
+
27
+ ```bash
28
+ GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H100 \
29
+ uv run skydiscover-run \
30
+ benchmarks/gpu_mode/trimul/initial_program.py \
31
+ benchmarks/gpu_mode/trimul/evaluator.py \
32
+ -c benchmarks/gpu_mode/trimul/config.yaml \
33
+ -s [your_algorithm] -i 50
34
+ ```
benchmarks/gpu_mode/trimul/config.yaml ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Triangle Multiplicative Update (TriMul) Triton Kernel
2
+
3
+ max_iterations: 100
4
+ checkpoint_interval: 1
5
+ log_level: "INFO"
6
+
7
+ llm:
8
+ models:
9
+ - name: "gpt-5"
10
+ weight: 1.0
11
+ api_base: https://api.openai.com/v1
12
+ temperature: 1.0
13
+ # top_p: 0.95 # omitted by default; some providers (e.g. Anthropic) reject both temperature and top_p
14
+ max_tokens: 32000
15
+ timeout: 600
16
+
17
+ prompt:
18
+ system_message: |
19
+ You are an expert Triton engineer tasked with translating PyTorch code into highly optimized Triton kernel code.
20
+
21
+ You will be implementing a Triangle Multiplicative Update (TriMul) module that is a core operation
22
+ for AlphaFold3, Chai, Protenix, and other protein structure prediction models in BioML.
23
+
24
+ The TriMul operator operates over a 4D tensor of shape [B, N, N, C].
25
+
26
+ Your task:
27
+ - Implement the "outgoing" version of the TriMul operator from the AlphaFold3 paper.
28
+ - You will not have to compute or store gradients for this version. You will only need to implement the forward pass.
29
+
30
+ Your function should be defined as 'custom_kernel' with the following signature:
31
+ Input:
32
+ - `data`: Tuple of (input: torch.Tensor, weights: Dict[str, torch.Tensor], config: Dict)
33
+ - input: Input tensor of shape [bs, seq_len, seq_len, dim]
34
+ - mask: Mask tensor of shape [bs, seq_len, seq_len]
35
+ - weights: Dictionary containing model weights
36
+ - config: Dictionary containing model configuration parameters
37
+
38
+ Output:
39
+ - output: Processed tensor [bs, seq_len, seq_len, dim]
40
+
41
+ **Problem Constraints:**
42
+ - B ∈ {1,2}, N ∈ {128,256,512,1024}, c ∈ {128}, c_z ∈ {128,384,768}
43
+ - The input distribution will be sampled from a standard Normal distribution, or a heavy-tailed Cauchy distribution (gamma = 2).
44
+ - There will either be no mask, or a randomly sampled mask over the inputs.
45
+
46
+ **Remarks.** So why is this problem so annoying? Because you have to choose whether to load / deal with either the channel dimensions c,c_z that the LayerNorms require (otherwise you have to do a synchronize to compute the statistics like mean / variance) or the sequence dimension N.
47
+ The sequence dimension is particularly annoying because it's quite large, but also because we compute pair-wise operations at the last operation that sum over another sequence dimension (this is N^3!).
48
+ However, I really like this kernel because it only consists of "simple" operations, and is really easy to understand. It is a true test of "fusions" that torch.compile() doesn't do that well.
49
+
50
+ Here is a pytorch implementation of the TriMul module. You will want to implement a kernel for the operations in the forward call:
51
+
52
+ ```python
53
+ import torch
54
+ from torch import nn, einsum
55
+ import math
56
+
57
+ # Reference code in PyTorch
58
+ class TriMul(nn.Module):
59
+ def __init__(
60
+ self,
61
+ dim: int,
62
+ hidden_dim: int,
63
+ ):
64
+ super().__init__()
65
+
66
+ self.norm = nn.LayerNorm(dim)
67
+
68
+ self.left_proj = nn.Linear(dim, hidden_dim, bias=False)
69
+ self.right_proj = nn.Linear(dim, hidden_dim, bias=False)
70
+
71
+ self.left_gate = nn.Linear(dim, hidden_dim, bias=False)
72
+ self.right_gate = nn.Linear(dim, hidden_dim, bias=False)
73
+ self.out_gate = nn.Linear(dim, hidden_dim, bias=False)
74
+
75
+ self.to_out_norm = nn.LayerNorm(hidden_dim)
76
+ self.to_out = nn.Linear(hidden_dim, dim, bias=False)
77
+
78
+ def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
79
+ """
80
+ x: [bs, seq_len, seq_len, dim]
81
+ mask: [bs, seq_len, seq_len]
82
+
83
+ Returns:
84
+ output: [bs, seq_len, seq_len, dim]
85
+ """
86
+ batch_size, seq_len, _, dim = x.shape
87
+
88
+ x = self.norm(x)
89
+
90
+ left = self.left_proj(x)
91
+ right = self.right_proj(x)
92
+
93
+ mask = mask.unsqueeze(-1)
94
+ left = left * mask
95
+ right = right * mask
96
+
97
+ left_gate = self.left_gate(x).sigmoid()
98
+ right_gate = self.right_gate(x).sigmoid()
99
+ out_gate = self.out_gate(x).sigmoid()
100
+
101
+ left = left * left_gate
102
+ right = right * right_gate
103
+
104
+ out = einsum('... i k d, ... j k d -> ... i j d', left, right)
105
+ # This einsum is the same as the following:
106
+ # out = torch.zeros(batch_size, seq_len, seq_len, dim, device=x.device)
107
+
108
+ # # Compute using nested loops
109
+ # for b in range(batch_size):
110
+ # for i in range(seq_len):
111
+ # for j in range(seq_len):
112
+ # # Compute each output element
113
+ # for k in range(seq_len):
114
+ # out[b, i, j] += left[b, i, k, :] * right[b, j, k, :]
115
+
116
+ out = self.to_out_norm(out)
117
+ out = out * out_gate
118
+ return self.to_out(out)
119
+ ```
120
+
121
+ Here is some example skeleton code of the entrypoint function you will create:
122
+ ```python
123
+ def custom_kernel(data)
124
+ input_tensor, mask, weights, config = data
125
+ dim, hidden_dim = config["dim"], config["hidden_dim"]
126
+
127
+ # Access the given weights of the model
128
+ norm_weight = weights["norm.weight"]
129
+ norm_bias = weights["norm.bias"]
130
+ left_proj_weight = weights["left_proj.weight"]
131
+ right_proj_weight = weights["right_proj.weight"]
132
+ left_gate_weight = weights["left_gate.weight"]
133
+ right_gate_weight = weights["right_gate.weight"]
134
+ out_gate_weight = weights["out_gate.weight"]
135
+ to_out_norm_weight = weights["to_out_norm.weight"]
136
+ to_out_norm_bias = weights["to_out_norm.bias"]
137
+ to_out_weight = weights["to_out.weight"]
138
+
139
+ # Perform TriMul
140
+
141
+ return out
142
+ ```
143
+
144
+ To help you understand which triton version we are using, here is some example triton code for an unrelated task:
145
+ ```python
146
+ import triton
147
+ import triton.language as tl
148
+
149
+ @triton.jit
150
+ def matmul_persistent_ws_kernel(
151
+ a_ptr, b_ptr, c_ptr, M, N, K,
152
+ stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
153
+ BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr,
154
+ ):
155
+ pid = tl.program_id(axis=0) # async_task 0, 1, 2
156
+ num_pid_m = tl.cdiv(M, BLOCK_M) # async_task 0, 1, 2
157
+ num_pid_n = tl.cdiv(N, BLOCK_N) # async_task 0, 1, 2
158
+ pid_m = pid // num_pid_m # async_task 0, 1, 2
159
+ pid_n = pid % num_pid_n # async_task 0, 1, 2
160
+ offs_m_1 = pid_m * BLOCK_M + tl.arange(0, BLOCK_M // 2) # async_task 0, 1, 2
161
+ offs_m_2 = pid_m * BLOCK_M + tl.arange(BLOCK_M // 2, BLOCK_M) # async_task 0, 1, 2
162
+ offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_N) # async_task 0, 1, 2
163
+ offs_k = tl.arange(0, BLOCK_K) # async_task 0
164
+ a_ptrs_1 = a_ptr + (offs_m_1[:, None] * stride_am + offs_k[None, :] * stride_ak) # async_task 0
165
+ a_ptrs_2 = a_ptr + (offs_m_2[:, None] * stride_am + offs_k[None, :] * stride_ak) # async_task 0
166
+ b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) # async_task 0
167
+ acc_1 = tl.zeros((BLOCK_M // 2, BLOCK_N), dtype=tl.float32) # async_task 1
168
+ acc_1 = tl.zeros((BLOCK_M // 2, BLOCK_N), dtype=tl.float32) # async_task 2
169
+ for k in range(0, tl.cdiv(K, BLOCK_K)): # async_task 0, 1, 2
170
+ a_1 = tl.load(a_ptrs_1) # async_task 0
171
+ a_2 = tl.load(a_ptrs_2) # async_task 0
172
+ b = tl.load(b_ptrs) # async_task 0
173
+ acc_1 += tl.dot(a_1, b) # async_task 1
174
+ acc_2 += tl.dot(a_2, b) # async_task 2
175
+ a_ptrs_1 += BLOCK_K * stride_ak # async_task 0
176
+ a_ptrs_2 += BLOCK_K * stride_ak # async_task 0
177
+ b_ptrs += BLOCK_K * stride_bk # async_task 0
178
+ c_1 = acc_1.to(tl.float16) # async_task 1
179
+ c_2 = acc_2.to(tl.float16) # async_task 2
180
+ c_ptrs_1 = c_ptr_1 + stride_cm * offs_m_1[:, None] + stride_cn * offs_n[None, :] # async_task 1
181
+ c_ptrs_2 = c_ptr_2 + stride_cm * offs_m_2[:, None] + stride_cn * offs_n[None, :] # async_task 2
182
+ tl.store(c_ptrs_1, c_1) # async_task 1
183
+ tl.store(c_ptrs_2, c_2) # async_task 2
184
+ ```
185
+
186
+ A few general triton tips:
187
+ - tl.arange only takes in constexpr arguments (static or tl.constexpr)
188
+ - You cannot use continue in your kernel code
189
+ - tl.dot can only take in two input tensors
190
+ - There is no tl.mean
191
+
192
+ Here are the different configs that your kernel will be tested on ("nomask" sets whether there will be no mask, or a randomly sampled mask over the inputs):
193
+
194
+ Test Cases for correctness and runtime (optimize runtime for these):
195
+ - {"seqlen": 256, "bs": 2, "dim": 128, "hidden_dim": 128, "nomask": True, "distribution": "normal"}
196
+ - {"seqlen": 768, "bs": 1, "dim": 128, "hidden_dim": 128, "nomask": True, "distribution": "cauchy"}
197
+ - {"seqlen": 256, "bs": 2, "dim": 384, "hidden_dim": 128, "nomask": False, "distribution": "normal"}
198
+ - {"seqlen": 512, "bs": 1, "dim": 128, "hidden_dim": 128, "nomask": True, "distribution": "normal"}
199
+ - {"seqlen": 1024, "bs": 1, "dim": 128, "hidden_dim": 128, "nomask": True, "distribution": "cauchy"}
200
+ - {"seqlen": 768, "bs": 1, "dim": 384, "hidden_dim": 128, "nomask": False, "distribution": "normal"}
201
+ - {"seqlen": 1024, "bs": 1, "dim": 384, "hidden_dim": 128, "nomask": True, "distribution": "normal"}
202
+
203
+ Rules:
204
+ - The tensors arguments passed in will be already on your cuda device.
205
+ - Define all of your code in one final ```python ``` block.
206
+ - We will test the correctness of your kernel on multiple input shapes, make sure to support different potential test cases.
207
+ - You are allowed to use mixed precision computations, but make sure your final output is in float32.
208
+ - You must use trition 3.3.1 and these kernels will be run on an H100.
209
+ - You do not have to implement everything in triton, you may choose to have some of the operations done in pytorch. However, you must implement at least part of the operations in a kernel.
210
+ - Include a short docstring at the top summarizing your algorithm.
211
+ evaluator:
212
+ timeout: 600
213
+ max_retries: 3
214
+ cascade_evaluation: true
215
+ cascade_thresholds: [0.4, 0.3]
216
+
217
+ diff_based_generation: true
218
+ max_solution_length: 60000
219
+ random_seed: 42
benchmarks/gpu_mode/trimul/evaluator.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluator for TriMul — delegates to shared evaluator."""
2
+ import os
3
+ import sys
4
+
5
+ _problem_dir = os.path.dirname(os.path.abspath(__file__))
6
+ _parent_dir = os.path.dirname(_problem_dir)
7
+
8
+ if _problem_dir not in sys.path:
9
+ sys.path.insert(0, _problem_dir)
10
+ if _parent_dir not in sys.path:
11
+ sys.path.insert(0, _parent_dir)
12
+
13
+ from shared_eval import evaluate, evaluate_stage1, evaluate_stage2
benchmarks/gpu_mode/trimul/reference.py ADDED
@@ -0,0 +1,286 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Reference implementation for Triangle Multiplicative Update (TriMul) Triton kernel.
3
+ Core operation for AlphaFold3, Chai, Protenix protein structure models.
4
+ Same test cases, benchmarks, generate_input, ref_kernel, and check_implementation.
5
+ """
6
+
7
+ import math
8
+ import torch
9
+ from torch import nn, einsum
10
+
11
+ # ---------------------------------------------------------------------------
12
+ # Scoring and benchmark configuration (read by shared_eval.py)
13
+ # ---------------------------------------------------------------------------
14
+
15
+ SCORE_SCALE = 3000.0
16
+
17
+ # trimul uses CUDA events timing, 0.1% rel error, 120s wall clock timeout
18
+ BENCH_USE_CUDA_EVENTS = True
19
+ BENCH_REL_ERROR = 0.001
20
+ BENCH_WALL_TIMEOUT_NS = 120e9
21
+ BENCH_NO_GRAD = False
22
+ BENCH_MAX_REPEATS = 100
23
+ BENCH_MAX_TIME_NS = 10e9
24
+ BENCH_WARMUP_STYLE = 'tiny_benchmark'
25
+
26
+ # ---------------------------------------------------------------------------
27
+ # Test / benchmark cases — full set from discover task.yml
28
+ # ---------------------------------------------------------------------------
29
+
30
+ TEST_CASES = [
31
+ {"seqlen": 32, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 9371, "nomask": True, "distribution": "normal"},
32
+ {"seqlen": 32, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 1092, "nomask": False, "distribution": "normal"},
33
+ {"seqlen": 64, "bs": 2, "dim": 256, "hiddendim": 128, "seed": 2291, "nomask": True, "distribution": "normal"},
34
+ {"seqlen": 64, "bs": 2, "dim": 256, "hiddendim": 128, "seed": 210284, "nomask": False, "distribution": "normal"},
35
+ {"seqlen": 128, "bs": 1, "dim": 768, "hiddendim": 128, "seed": 81934, "nomask": True, "distribution": "normal"},
36
+ {"seqlen": 256, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 1932, "nomask": True, "distribution": "normal"},
37
+ {"seqlen": 256, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 10432, "nomask": False, "distribution": "normal"},
38
+ {"seqlen": 768, "bs": 2, "dim": 128, "hiddendim": 128, "seed": 731, "nomask": True, "distribution": "normal"},
39
+ {"seqlen": 1024, "bs": 1, "dim": 384, "hiddendim": 128, "seed": 53121, "nomask": False, "distribution": "normal"},
40
+ {"seqlen": 1024, "bs": 1, "dim": 768, "hiddendim": 128, "seed": 31, "nomask": True, "distribution": "normal"},
41
+ {"seqlen": 1024, "bs": 1, "dim": 768, "hiddendim": 128, "seed": 4921, "nomask": False, "distribution": "normal"},
42
+ {"seqlen": 32, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 937321, "nomask": True, "distribution": "cauchy"},
43
+ {"seqlen": 64, "bs": 2, "dim": 256, "hiddendim": 128, "seed": 2291, "nomask": True, "distribution": "cauchy"},
44
+ {"seqlen": 128, "bs": 1, "dim": 768, "hiddendim": 128, "seed": 8134, "nomask": True, "distribution": "cauchy"},
45
+ {"seqlen": 256, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 932, "nomask": True, "distribution": "cauchy"},
46
+ {"seqlen": 768, "bs": 2, "dim": 128, "hiddendim": 128, "seed": 31, "nomask": True, "distribution": "cauchy"},
47
+ {"seqlen": 1024, "bs": 1, "dim": 384, "hiddendim": 128, "seed": 5321, "nomask": False, "distribution": "cauchy"},
48
+ {"seqlen": 1024, "bs": 1, "dim": 768, "hiddendim": 128, "seed": 491, "nomask": False, "distribution": "cauchy"},
49
+ ]
50
+
51
+ BENCHMARK_CASES = [
52
+ {"seqlen": 256, "bs": 2, "dim": 128, "hiddendim": 128, "seed": 9371, "nomask": True, "distribution": "normal"},
53
+ {"seqlen": 768, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 381, "nomask": True, "distribution": "cauchy"},
54
+ {"seqlen": 256, "bs": 2, "dim": 384, "hiddendim": 128, "seed": 2301, "nomask": False, "distribution": "normal"},
55
+ {"seqlen": 512, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 12819, "nomask": True, "distribution": "normal"},
56
+ {"seqlen": 1024, "bs": 1, "dim": 128, "hiddendim": 128, "seed": 381, "nomask": True, "distribution": "cauchy"},
57
+ {"seqlen": 768, "bs": 1, "dim": 384, "hiddendim": 128, "seed": 481, "nomask": False, "distribution": "normal"},
58
+ {"seqlen": 1024, "bs": 1, "dim": 384, "hiddendim": 128, "seed": 23291, "nomask": True, "distribution": "normal"},
59
+ ]
60
+
61
+ # ---------------------------------------------------------------------------
62
+ # Reference kernel
63
+ # ---------------------------------------------------------------------------
64
+
65
+
66
+ class _TriMul(nn.Module):
67
+ def __init__(self, dim, hidden_dim, device="cuda"):
68
+ super().__init__()
69
+ self.norm = nn.LayerNorm(dim, device=device)
70
+ self.left_proj = nn.Linear(dim, hidden_dim, bias=False, device=device)
71
+ self.right_proj = nn.Linear(dim, hidden_dim, bias=False, device=device)
72
+ self.left_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
73
+ self.right_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
74
+ self.out_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
75
+ self.to_out_norm = nn.LayerNorm(hidden_dim, device=device)
76
+ self.to_out = nn.Linear(hidden_dim, dim, bias=False, device=device)
77
+
78
+ def forward(self, x, mask):
79
+ x = self.norm(x)
80
+ left = self.left_proj(x)
81
+ right = self.right_proj(x)
82
+ mask = mask.unsqueeze(-1)
83
+ left = left * mask
84
+ right = right * mask
85
+ left = left * self.left_gate(x).sigmoid()
86
+ right = right * self.right_gate(x).sigmoid()
87
+ out_gate = self.out_gate(x).sigmoid()
88
+ out = einsum('... i k d, ... j k d -> ... i j d', left, right)
89
+ out = self.to_out_norm(out)
90
+ out = out * out_gate
91
+ return self.to_out(out)
92
+
93
+
94
+ def ref_kernel(data):
95
+ old_matmul = torch.backends.cuda.matmul.allow_tf32
96
+ old_cudnn = torch.backends.cudnn.allow_tf32
97
+ torch.backends.cuda.matmul.allow_tf32 = False
98
+ torch.backends.cudnn.allow_tf32 = False
99
+ try:
100
+ input_tensor, mask, weights, config = data
101
+ trimul = _TriMul(dim=config["dim"], hidden_dim=config["hidden_dim"],
102
+ device=input_tensor.device)
103
+ trimul.norm.weight = nn.Parameter(weights['norm.weight'])
104
+ trimul.norm.bias = nn.Parameter(weights['norm.bias'])
105
+ trimul.left_proj.weight = nn.Parameter(weights['left_proj.weight'])
106
+ trimul.right_proj.weight = nn.Parameter(weights['right_proj.weight'])
107
+ trimul.left_gate.weight = nn.Parameter(weights['left_gate.weight'])
108
+ trimul.right_gate.weight = nn.Parameter(weights['right_gate.weight'])
109
+ trimul.out_gate.weight = nn.Parameter(weights['out_gate.weight'])
110
+ trimul.to_out_norm.weight = nn.Parameter(weights['to_out_norm.weight'])
111
+ trimul.to_out_norm.bias = nn.Parameter(weights['to_out_norm.bias'])
112
+ trimul.to_out.weight = nn.Parameter(weights['to_out.weight'])
113
+ return trimul(input_tensor, mask)
114
+ finally:
115
+ torch.backends.cuda.matmul.allow_tf32 = old_matmul
116
+ torch.backends.cudnn.allow_tf32 = old_cudnn
117
+
118
+
119
+ def generate_input(seqlen, bs, dim, hiddendim, seed, nomask, distribution="normal"):
120
+ hidden_dim = hiddendim
121
+ config = {"hidden_dim": hidden_dim, "dim": dim}
122
+ gen = torch.Generator(device='cuda')
123
+ gen.manual_seed(seed)
124
+
125
+ if distribution == "cauchy":
126
+ u = torch.empty((bs, seqlen, seqlen, dim), device="cuda", dtype=torch.float32)
127
+ u.uniform_(0.0, 1.0, generator=gen)
128
+ input_tensor = 2.0 * torch.tan(math.pi * (u - 0.5))
129
+ else:
130
+ input_tensor = torch.randn(
131
+ (bs, seqlen, seqlen, dim), device='cuda', dtype=torch.float32, generator=gen
132
+ ).contiguous()
133
+
134
+ if nomask:
135
+ mask = torch.ones(bs, seqlen, seqlen, device="cuda")
136
+ else:
137
+ mask = torch.randint(0, 2, (bs, seqlen, seqlen), device="cuda", generator=gen).float()
138
+
139
+ weights = {
140
+ "norm.weight": torch.randn(dim, device="cuda"),
141
+ "norm.bias": torch.randn(dim, device="cuda"),
142
+ "left_proj.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
143
+ "right_proj.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
144
+ "left_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
145
+ "right_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
146
+ "out_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
147
+ "to_out_norm.weight": torch.randn(hidden_dim, device="cuda"),
148
+ "to_out_norm.bias": torch.randn(hidden_dim, device="cuda"),
149
+ "to_out.weight": torch.randn(dim, hidden_dim, device="cuda") / math.sqrt(dim),
150
+ }
151
+ return (input_tensor, mask, weights, config)
152
+
153
+
154
+ def check_implementation(data, submission_output, rtol=2e-2, atol=2e-2):
155
+ old_matmul = torch.backends.cuda.matmul.allow_tf32
156
+ old_cudnn = torch.backends.cudnn.allow_tf32
157
+ torch.backends.cuda.matmul.allow_tf32 = False
158
+ torch.backends.cudnn.allow_tf32 = False
159
+ try:
160
+ ref_output = ref_kernel(data)
161
+ if ref_output.shape != submission_output.shape:
162
+ return False, f"Shape mismatch: {ref_output.shape} vs {submission_output.shape}"
163
+ if torch.allclose(ref_output.float(), submission_output.float(), rtol=rtol, atol=atol):
164
+ return True, "Match"
165
+ diff = torch.abs(ref_output.float() - submission_output.float())
166
+ return False, f"max_diff={diff.max().item():.6f}, avg_diff={diff.mean().item():.6f}"
167
+ finally:
168
+ torch.backends.cuda.matmul.allow_tf32 = old_matmul
169
+ torch.backends.cudnn.allow_tf32 = old_cudnn
170
+
171
+
172
+ # ---------------------------------------------------------------------------
173
+ # Self-contained reference code for Modal remote execution
174
+ # ---------------------------------------------------------------------------
175
+
176
+ MODAL_REFERENCE_CODE = r'''
177
+ import math
178
+ import torch
179
+ from torch import nn, einsum
180
+
181
+
182
+ class _TriMul(nn.Module):
183
+ def __init__(self, dim, hidden_dim, device="cuda"):
184
+ super().__init__()
185
+ self.norm = nn.LayerNorm(dim, device=device)
186
+ self.left_proj = nn.Linear(dim, hidden_dim, bias=False, device=device)
187
+ self.right_proj = nn.Linear(dim, hidden_dim, bias=False, device=device)
188
+ self.left_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
189
+ self.right_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
190
+ self.out_gate = nn.Linear(dim, hidden_dim, bias=False, device=device)
191
+ self.to_out_norm = nn.LayerNorm(hidden_dim, device=device)
192
+ self.to_out = nn.Linear(hidden_dim, dim, bias=False, device=device)
193
+
194
+ def forward(self, x, mask):
195
+ x = self.norm(x)
196
+ left = self.left_proj(x)
197
+ right = self.right_proj(x)
198
+ mask = mask.unsqueeze(-1)
199
+ left = left * mask
200
+ right = right * mask
201
+ left = left * self.left_gate(x).sigmoid()
202
+ right = right * self.right_gate(x).sigmoid()
203
+ out_gate = self.out_gate(x).sigmoid()
204
+ out = einsum('... i k d, ... j k d -> ... i j d', left, right)
205
+ out = self.to_out_norm(out)
206
+ out = out * out_gate
207
+ return self.to_out(out)
208
+
209
+
210
+ def ref_kernel(data):
211
+ old_matmul = torch.backends.cuda.matmul.allow_tf32
212
+ old_cudnn = torch.backends.cudnn.allow_tf32
213
+ torch.backends.cuda.matmul.allow_tf32 = False
214
+ torch.backends.cudnn.allow_tf32 = False
215
+ try:
216
+ input_tensor, mask, weights, config = data
217
+ trimul = _TriMul(dim=config["dim"], hidden_dim=config["hidden_dim"],
218
+ device=input_tensor.device)
219
+ trimul.norm.weight = nn.Parameter(weights['norm.weight'])
220
+ trimul.norm.bias = nn.Parameter(weights['norm.bias'])
221
+ trimul.left_proj.weight = nn.Parameter(weights['left_proj.weight'])
222
+ trimul.right_proj.weight = nn.Parameter(weights['right_proj.weight'])
223
+ trimul.left_gate.weight = nn.Parameter(weights['left_gate.weight'])
224
+ trimul.right_gate.weight = nn.Parameter(weights['right_gate.weight'])
225
+ trimul.out_gate.weight = nn.Parameter(weights['out_gate.weight'])
226
+ trimul.to_out_norm.weight = nn.Parameter(weights['to_out_norm.weight'])
227
+ trimul.to_out_norm.bias = nn.Parameter(weights['to_out_norm.bias'])
228
+ trimul.to_out.weight = nn.Parameter(weights['to_out.weight'])
229
+ return trimul(input_tensor, mask)
230
+ finally:
231
+ torch.backends.cuda.matmul.allow_tf32 = old_matmul
232
+ torch.backends.cudnn.allow_tf32 = old_cudnn
233
+
234
+
235
+ def generate_input(seqlen, bs, dim, hiddendim, seed, nomask, distribution="normal"):
236
+ hidden_dim = hiddendim
237
+ config = {"hidden_dim": hidden_dim, "dim": dim}
238
+ gen = torch.Generator(device='cuda')
239
+ gen.manual_seed(seed)
240
+
241
+ if distribution == "cauchy":
242
+ u = torch.empty((bs, seqlen, seqlen, dim), device="cuda", dtype=torch.float32)
243
+ u.uniform_(0.0, 1.0, generator=gen)
244
+ input_tensor = 2.0 * torch.tan(math.pi * (u - 0.5))
245
+ else:
246
+ input_tensor = torch.randn(
247
+ (bs, seqlen, seqlen, dim), device='cuda', dtype=torch.float32, generator=gen
248
+ ).contiguous()
249
+
250
+ if nomask:
251
+ mask = torch.ones(bs, seqlen, seqlen, device="cuda")
252
+ else:
253
+ mask = torch.randint(0, 2, (bs, seqlen, seqlen), device="cuda", generator=gen).float()
254
+
255
+ weights = {
256
+ "norm.weight": torch.randn(dim, device="cuda"),
257
+ "norm.bias": torch.randn(dim, device="cuda"),
258
+ "left_proj.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
259
+ "right_proj.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
260
+ "left_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
261
+ "right_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
262
+ "out_gate.weight": torch.randn(hidden_dim, dim, device="cuda") / math.sqrt(hidden_dim),
263
+ "to_out_norm.weight": torch.randn(hidden_dim, device="cuda"),
264
+ "to_out_norm.bias": torch.randn(hidden_dim, device="cuda"),
265
+ "to_out.weight": torch.randn(dim, hidden_dim, device="cuda") / math.sqrt(dim),
266
+ }
267
+ return (input_tensor, mask, weights, config)
268
+
269
+
270
+ def check_implementation(data, submission_output, rtol=2e-2, atol=2e-2):
271
+ old_matmul = torch.backends.cuda.matmul.allow_tf32
272
+ old_cudnn = torch.backends.cudnn.allow_tf32
273
+ torch.backends.cuda.matmul.allow_tf32 = False
274
+ torch.backends.cudnn.allow_tf32 = False
275
+ try:
276
+ ref_output = ref_kernel(data)
277
+ if ref_output.shape != submission_output.shape:
278
+ return False, f"Shape mismatch: {ref_output.shape} vs {submission_output.shape}"
279
+ if torch.allclose(ref_output.float(), submission_output.float(), rtol=rtol, atol=atol):
280
+ return True, "Match"
281
+ diff = torch.abs(ref_output.float() - submission_output.float())
282
+ return False, f"max_diff={diff.max().item():.6f}, avg_diff={diff.mean().item():.6f}"
283
+ finally:
284
+ torch.backends.cuda.matmul.allow_tf32 = old_matmul
285
+ torch.backends.cudnn.allow_tf32 = old_cudnn
286
+ '''
benchmarks/gpu_mode/trimul/requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ triton
2
+ torch
benchmarks/gpu_mode/vecadd/README.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Float16 Vector Addition
2
+
3
+ Evolve a Triton kernel for float16 vector addition using SkyDiscover.
4
+
5
+ **Operation:** `C = A + B` (element-wise, float16)
6
+
7
+ ## Quick Start
8
+
9
+ From the repo root:
10
+
11
+ ```bash
12
+ uv run skydiscover-run \
13
+ benchmarks/gpu_mode/vecadd/initial_program.py \
14
+ benchmarks/gpu_mode/vecadd/evaluator.py \
15
+ -c benchmarks/gpu_mode/vecadd/config.yaml \
16
+ -s [your_algorithm] -i 50
17
+ ```
18
+
19
+ ## Scoring
20
+
21
+ - **Correctness weight:** 0.3 (must return float16, rtol/atol=1e-3)
22
+ - **Speedup weight:** 1.0 (geometric mean vs PyTorch reference, capped at 10x)
23
+ - **Combined:** `0.3 * correctness + speedup`
24
+
25
+ ## Modal Cloud GPU Support
26
+
27
+ ```bash
28
+ GPUMODE_USE_MODAL=true GPUMODE_MODAL_GPU=H100 \
29
+ uv run skydiscover-run \
30
+ benchmarks/gpu_mode/vecadd/initial_program.py \
31
+ benchmarks/gpu_mode/vecadd/evaluator.py \
32
+ -c benchmarks/gpu_mode/vecadd/config.yaml \
33
+ -s [your_algorithm] -i 50
34
+ ```
benchmarks/gpu_mode/vecadd/config.yaml ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Mode: Float16 Vector Addition
2
+ # Usage: uv run skydiscover-run initial_program.py evaluator.py -c config.yaml -s <strategy> -i 50
3
+
4
+ max_iterations: 100
5
+ checkpoint_interval: 10
6
+ log_level: INFO
7
+
8
+ llm:
9
+ models:
10
+ - name: "gpt-5"
11
+ weight: 1.0
12
+ api_base: https://api.openai.com/v1
13
+ temperature: 0.7
14
+ # top_p: 0.95 # omitted by default; some providers (e.g. Anthropic) reject both temperature and top_p
15
+ max_tokens: 32000
16
+ timeout: 600
17
+
18
+ prompt:
19
+ system_message: |
20
+ You are an expert Triton kernel engineer. Output ONLY Python code - no explanations.
21
+
22
+ REQUIRED OUTPUT STRUCTURE:
23
+ 1. Imports: torch, triton, triton.language as tl
24
+ 2. @triton.jit kernel function(s)
25
+ 3. def custom_kernel(data) wrapper - REQUIRED entry point
26
+
27
+ Task: Optimize float16 vector addition kernel. C = A + B
28
+ Input: Tuple of (A, B) tensors of shape (N, N) and dtype torch.float16
29
+ Output: Tensor of shape (N, N) and dtype torch.float16
30
+ N can be: 256, 512, 1024, 2048, 4096, 8192
31
+
32
+ Optimization tips:
33
+ - Block size tuning (512, 1024, 2048, 4096)
34
+ - Use @triton.autotune for automatic parameter tuning
35
+ - Vectorized loads for memory operations
36
+ - Grid configuration for occupancy
37
+ - Memory coalescing for sequential access patterns
38
+
39
+ MUST use @triton.jit decorator. MUST return float16 tensor.
40
+ Output complete, working code in a single ```python``` block.
41
+
42
+ evaluator:
43
+ timeout: 600
44
+ max_retries: 3
45
+ cascade_evaluation: true
46
+ cascade_thresholds: [0.4, 0.3]
47
+
48
+ diff_based_generation: true
49
+ max_solution_length: 60000
50
+ random_seed: 42
benchmarks/gpu_mode/vecadd/evaluator.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluator for float16 Vector Addition — delegates to shared evaluator."""
2
+ import os
3
+ import sys
4
+
5
+ _problem_dir = os.path.dirname(os.path.abspath(__file__))
6
+ _parent_dir = os.path.dirname(_problem_dir)
7
+
8
+ if _problem_dir not in sys.path:
9
+ sys.path.insert(0, _problem_dir)
10
+ if _parent_dir not in sys.path:
11
+ sys.path.insert(0, _parent_dir)
12
+
13
+ from shared_eval import evaluate, evaluate_stage1, evaluate_stage2
benchmarks/gpu_mode/vecadd/initial_program.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EVOLVE-BLOCK-START
2
+ """
3
+ Initial float16 vector addition with Triton kernel.
4
+ """
5
+
6
+ import torch
7
+ import triton
8
+ import triton.language as tl
9
+
10
+
11
+ @triton.jit
12
+ def vecadd_kernel(
13
+ a_ptr, b_ptr, c_ptr,
14
+ n_elements,
15
+ BLOCK_SIZE: tl.constexpr,
16
+ ):
17
+ pid = tl.program_id(0)
18
+ block_start = pid * BLOCK_SIZE
19
+ offsets = block_start + tl.arange(0, BLOCK_SIZE)
20
+ mask = offsets < n_elements
21
+
22
+ a = tl.load(a_ptr + offsets, mask=mask)
23
+ b = tl.load(b_ptr + offsets, mask=mask)
24
+ c = a + b
25
+
26
+ tl.store(c_ptr + offsets, c, mask=mask)
27
+
28
+
29
+ def custom_kernel(data):
30
+ a, b = data
31
+ a = a.contiguous()
32
+ b = b.contiguous()
33
+ c = torch.empty_like(a)
34
+ n_elements = a.numel()
35
+ BLOCK_SIZE = 1024
36
+ grid = (triton.cdiv(n_elements, BLOCK_SIZE),)
37
+ vecadd_kernel[grid](a, b, c, n_elements, BLOCK_SIZE=BLOCK_SIZE)
38
+ return c
39
+ # EVOLVE-BLOCK-END
benchmarks/gpu_mode/vecadd/reference.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Reference implementation for float16 vector addition Triton kernel.
3
+ C = A + B
4
+ """
5
+
6
+ import math
7
+ try:
8
+ import torch
9
+ except ImportError:
10
+ torch = None # Modal-only mode — functions below won't be called locally
11
+
12
+ # ---------------------------------------------------------------------------
13
+ # Reward parameters
14
+ # ---------------------------------------------------------------------------
15
+
16
+ CORRECTNESS_WEIGHT = 0.3
17
+ SPEED_WEIGHT = 1.0
18
+ SPEED_MAX_REWARD = 10.0
19
+
20
+ # ---------------------------------------------------------------------------
21
+ # Test / benchmark cases
22
+ # ---------------------------------------------------------------------------
23
+
24
+ TEST_CASES = [
25
+ {"N": 256, "seed": 42},
26
+ {"N": 512, "seed": 123},
27
+ {"N": 1024, "seed": 456},
28
+ {"N": 2048, "seed": 789},
29
+ ]
30
+
31
+ BENCHMARK_CASES = [
32
+ {"N": 1024, "seed": 1001},
33
+ {"N": 2048, "seed": 1002},
34
+ {"N": 4096, "seed": 1003},
35
+ {"N": 8192, "seed": 1004},
36
+ ]
37
+
38
+ # ---------------------------------------------------------------------------
39
+ # Reference kernel
40
+ # ---------------------------------------------------------------------------
41
+
42
+
43
+ def ref_kernel(data):
44
+ a, b = data
45
+ return a + b
46
+
47
+
48
+ def generate_input(N, seed):
49
+ gen = torch.Generator(device="cuda")
50
+ gen.manual_seed(seed)
51
+ a = torch.randn(N, N, device="cuda", dtype=torch.float16, generator=gen)
52
+ b = torch.randn(N, N, device="cuda", dtype=torch.float16, generator=gen)
53
+ return (a, b)
54
+
55
+
56
+ def check_implementation(data, output, rtol=1e-3, atol=1e-3):
57
+ ref_out = ref_kernel(data)
58
+ if output.shape != ref_out.shape:
59
+ return False, f"Shape mismatch: expected {ref_out.shape}, got {output.shape}"
60
+ if output.dtype != torch.float16:
61
+ return False, f"Dtype mismatch: expected float16, got {output.dtype}"
62
+ if torch.allclose(output, ref_out, rtol=rtol, atol=atol):
63
+ return True, "Match"
64
+ diff = torch.abs(output.float() - ref_out.float())
65
+ return False, f"Output mismatch: max_diff={diff.max().item():.6f}"
66
+
67
+
68
+ # ---------------------------------------------------------------------------
69
+ # Self-contained reference code for Modal execution
70
+ # ---------------------------------------------------------------------------
71
+
72
+ MODAL_REFERENCE_CODE = '''
73
+ import torch
74
+
75
+ def ref_kernel(data):
76
+ a, b = data
77
+ return a + b
78
+
79
+ def generate_input(N, seed):
80
+ gen = torch.Generator(device="cuda")
81
+ gen.manual_seed(seed)
82
+ a = torch.randn(N, N, device="cuda", dtype=torch.float16, generator=gen)
83
+ b = torch.randn(N, N, device="cuda", dtype=torch.float16, generator=gen)
84
+ return (a, b)
85
+
86
+ def check_implementation(data, output, rtol=1e-3, atol=1e-3):
87
+ ref_out = ref_kernel(data)
88
+ if output.shape != ref_out.shape:
89
+ return False, f"Shape mismatch: expected {ref_out.shape}, got {output.shape}"
90
+ if output.dtype != torch.float16:
91
+ return False, f"Dtype mismatch: expected float16, got {output.dtype}"
92
+ if torch.allclose(output, ref_out, rtol=rtol, atol=atol):
93
+ return True, "Match"
94
+ diff = torch.abs(output.float() - ref_out.float())
95
+ return False, f"Output mismatch: max_diff={diff.max().item():.6f}"
96
+ '''
benchmarks/gpu_mode/vecadd/requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ triton
2
+ torch
benchmarks/kernelbench/README.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # KernelBench Integration with SkyDiscover
2
+
3
+ GPU kernel optimization tasks using the [KernelBench](https://github.com/ScalingIntelligence/KernelBench) dataset and evaluation protocol.
4
+
5
+ ## Overview
6
+
7
+ The KernelBench integration allows you to run SkyDiscover on any problem from the KernelBench dataset. The framework automatically:
8
+
9
+ 1. Fetches the reference implementation of the target kernel from KernelBench
10
+ 2. Creates an initial_program.py with EVOLVE-BLOCK markers
11
+ 3. Configures the evaluator with problem-specific parameters
12
+ 4. Runs the optimization using either a containerized or native Python evaluator
13
+
14
+ The evaluator uses the KernelBench evaluation infrastructure to measure speedup over PyTorch eager execution.
15
+
16
+ ### Evaluator Modes
17
+
18
+ - **Containerized (Docker)**: Runs evaluation inside a Docker container (default)
19
+ - **Native Python**: Runs evaluation directly as Python code (for clusters without Docker/Podman)
20
+
21
+ ## Directory Structure
22
+
23
+ ```
24
+ benchmarks/kernelbench/
25
+ ├── config.yaml # System prompt + search/evaluator settings
26
+ ├── resolver.py # Benchmark loader (fetches target problems from KernelBench)
27
+ ├── requirements.txt # Resolver dependencies (kernelbench library)
28
+ └── evaluator/ # Self-contained Docker benchmark
29
+ ├── Dockerfile # Container image definition
30
+ ├── evaluate.sh # Entrypoint (receives solution path)
31
+ ├── evaluator.py # Scoring logic using KernelBench
32
+ ├── requirements.txt # Evaluator dependencies (kernelbench[gpu])
33
+ └── wrapper.py # JSON protocol wrapper
34
+ ```
35
+
36
+ **Note:** The `run_and_check.py` script is downloaded directly from the KernelBench repository during Docker build (pinned to commit `423217d` for reproducibility). To update, modify the `KERNELBENCH_COMMIT` build arg in the Dockerfile.
37
+
38
+ ## Installation
39
+
40
+ Before using the KernelBench integration, install the required dependencies:
41
+
42
+ ```bash
43
+ # Install KernelBench library (required for problem fetching)
44
+ uv pip install -r benchmarks/kernelbench/requirements.txt
45
+ ```
46
+
47
+ **Note:** The resolver (problem fetching) only needs the base `kernelbench` package. The containerized evaluator installs `kernelbench[gpu]` for GPU support.
48
+
49
+ ## Quick Start
50
+
51
+ ### Using Docker (Default)
52
+
53
+ Edit `benchmarks/kernelbench/config.yaml` to select a target kernel from the [KernelBench database](https://huggingface.co/datasets/ScalingIntelligence/KernelBench):
54
+
55
+ ```yaml
56
+ benchmark:
57
+ # KernelBench problem specification
58
+ level: 2 # Problem difficulty level (1, 2, 3 or 4)
59
+ problem_id: 5 # Specific problem ID within the level
60
+ ```
61
+
62
+ Then, run optimization on this problem:
63
+
64
+ ```bash
65
+ # algo can be "adaevolve", "evox", "topk", "beam_search", "best_of_n", etc.
66
+ uv run skydiscover-run benchmarks/kernelbench/evaluator/ \
67
+ -c benchmarks/kernelbench/config.yaml \
68
+ --search <algo> \
69
+ --iterations 50
70
+ ```
71
+
72
+ ### Using Native Python (No Docker Required)
73
+
74
+ For clusters without Docker/Podman privileges, you can run the evaluator as native Python code.
75
+
76
+ #### 1. Install Dependencies
77
+
78
+ ```bash
79
+ # Install KernelBench with GPU support
80
+ pip install -r benchmarks/kernelbench/evaluator/requirements.txt
81
+ ```
82
+
83
+ #### 2. Configure Native Mode
84
+
85
+ Edit `benchmarks/kernelbench/config.yaml`:
86
+
87
+ ```yaml
88
+ benchmark:
89
+ enabled: true
90
+ name: kernelbench
91
+ resolver: benchmarks.kernelbench.resolver
92
+
93
+ # Set to false to use native Python evaluator (no Docker)
94
+ use_docker: false
95
+
96
+ level: 2
97
+ problem_id: 11
98
+ # ... rest of config
99
+ ```
100
+
101
+ #### 3. Run Optimization
102
+
103
+ ```bash
104
+ # algo can be "adaevolve", "evox", "topk", "beam_search", "best_of_n", etc.
105
+ uv run skydiscover-run benchmarks/kernelbench/evaluator/ \
106
+ -c benchmarks/kernelbench/config.yaml \
107
+ --search <algo> \
108
+ --iterations 50
109
+ ```
110
+
111
+ **Note:** The `run_and_check.py` script from KernelBench will be automatically downloaded on first run.
112
+
113
+ **Note:** No initial_program argument is needed - it is fetched automatically based on the `benchmark` section in config.yaml.
114
+
115
+ ## Configuration Reference
116
+
117
+ ### Benchmark Section
118
+
119
+ The `benchmark` section in `config.yaml` controls problem loading:
120
+
121
+ ```yaml
122
+ benchmark:
123
+ enabled: true # Enable benchmark loader
124
+ name: kernelbench # Benchmark name (for logging)
125
+ resolver: benchmarks.kernelbench.resolver # Python module path
126
+
127
+ # Evaluator mode
128
+ use_docker: true # true: containerized (Docker), false: native Python
129
+
130
+ # Problem specification
131
+ level: 1 # Difficulty: 1 (easy), 2 (medium), 3 (hard), 4 (very hard)
132
+ problem_id: 1 # Problem ID within the level
133
+
134
+ # Dataset source
135
+ dataset_src: huggingface # 'huggingface' or 'local'
136
+ dataset_name: ScalingIntelligence/KernelBench # HF dataset name
137
+
138
+ # Evaluation settings
139
+ eval_mode: local # 'local' or 'modal'
140
+ gpu: H100 # GPU type: H100, A100, etc.
141
+ num_correct_trials: 5 # Correctness validation runs
142
+ num_perf_trials: 100 # Performance measurement runs
143
+ ```
144
+
145
+ ### Environment Variables
146
+
147
+ The resolver provides these environment variables to the evaluator:
148
+
149
+ - `KERNELBENCH_LEVEL`: Problem difficulty level (1, 2, or 3)
150
+ - `KERNELBENCH_PROBLEM_ID`: Specific problem within the level
151
+ - `KERNELBENCH_EVAL_MODE`: Evaluation mode (local, modal)
152
+ - `KERNELBENCH_GPU`: GPU type (H100, A100, etc.)
153
+ - `KERNELBENCH_NUM_CORRECT_TRIALS`: Number of correctness validation runs
154
+ - `KERNELBENCH_NUM_PERF_TRIALS`: Number of performance measurement runs
155
+ - `KERNELBENCH_TIMEOUT`: Timeout per evaluation in seconds
156
+
157
+ These variables are passed directly to the evaluator (not set globally), ensuring isolation between concurrent runs.
158
+
159
+ ### Evaluation Modes
160
+
161
+ - **local**: Run evaluation on your local machine (requires GPU)
162
+ - **modal**: Run evaluation on Modal's cloud GPUs (requires Modal setup)
163
+
164
+ ### GPU Types
165
+
166
+ The list of currently supported GPU types can be found [here](https://github.com/ScalingIntelligence/KernelBench/blob/423217d9fda91e0c2d67e4a43bf62f96f6d104f1/scripts/run_and_check.py#L16).
167
+
168
+ ## Metrics
169
+
170
+ The evaluator returns:
171
+
172
+ - **combined_score**: Speedup over PyTorch eager execution (primary metric)
173
+ - **speedup_over_eager**: Same as combined_score
174
+ - **speedup_over_compile**: Speedup over torch.compile()
175
+ - **kernel_time_ms**: Execution time of optimized kernel
176
+ - **ref_eager_time_ms**: Reference eager execution time
177
+
178
+
179
+ ## Traditional Usage (Manual Initial Program)
180
+
181
+ You can still provide an initial program manually if needed:
182
+
183
+ ```bash
184
+ # Run with explicit initial program
185
+ uv run skydiscover-run my_kernel.py benchmarks/kernelbench/evaluator/ \
186
+ -c benchmarks/kernelbench/config.yaml \
187
+ --search <algo>
188
+ ```
189
+
190
+ ## Troubleshooting
191
+
192
+ ### Error: "kernelbench package not found"
193
+
194
+ Install KernelBench:
195
+ ```bash
196
+ pip install "kernelbench[gpu] @ git+https://github.com/ScalingIntelligence/KernelBench.git"
197
+ ```
198
+
199
+ ### Error: "Failed to resolve benchmark problem"
200
+
201
+ Check that:
202
+ 1. `benchmark.enabled` is `true` in config
203
+ 2. `level` and `problem_id` are valid
204
+ 3. KernelBench package is installed
205
+ 4. You have internet access (for HuggingFace dataset)
206
+
207
+ ### Generated Files Location
208
+
209
+ The framework creates temporary files in `/tmp/skydiscover_kernelbench_*/`:
210
+ - `initial_program.py`: Generated initial program
211
+ - Evaluator uses the existing `benchmarks/kernelbench/evaluator/` directory
benchmarks/kernelbench/__init__.py ADDED
File without changes
benchmarks/kernelbench/config.yaml ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # KernelBench optimization benchmark configuration
2
+ # Usage: skydiscover-run evaluator/ -c config.yaml -s <strategy>
3
+ # Note: initial_program is automatically fetched from KernelBench dataset, based on the `level` and `problem_id` fields.
4
+
5
+ language: python
6
+
7
+ # Benchmark loader configuration
8
+ benchmark:
9
+ enabled: true
10
+ name: kernelbench
11
+ resolver: benchmarks.kernelbench.resolver
12
+
13
+ # Evaluator mode: set to false for native Python (no Docker), true for containerized
14
+ use_docker: true # Set to false when running on clusters without Docker/Podman privileges
15
+
16
+ # KernelBench problem specification
17
+ level: 1 # Problem difficulty level (1, 2, 3 or 4)
18
+ problem_id: 1 # Specific problem ID within the level
19
+
20
+ dataset_src: huggingface # 'huggingface' or 'local'
21
+ dataset_name: ScalingIntelligence/KernelBench
22
+
23
+ # Evaluation configuration
24
+ eval_mode: local # 'local' or 'modal'
25
+ gpu: H100 # GPU type for evaluation
26
+ num_correct_trials: 5 # Number of correctness validation runs
27
+ num_perf_trials: 100 # Number of performance measurement runs
28
+
29
+ diff_based_generation: true
30
+ max_iterations: 100
31
+ checkpoint_interval: 10
32
+ max_solution_length: 60000
33
+
34
+ llm:
35
+ api_base: "${BASE_URL}"
36
+ api_key: "${API_KEY}"
37
+ models:
38
+ - name: "gpt-5"
39
+ weight: 1.0
40
+ max_tokens: 32000
41
+ timeout: 600
42
+
43
+ prompt:
44
+ system_message: |-
45
+ You are an expert in GPU kernel optimization and PyTorch performance engineering with deep expertise
46
+ in writing high-performance CUDA kernels, Triton kernels, and optimized PyTorch operations.
47
+
48
+ PROBLEM SPECIFICATION:
49
+
50
+ Your task is to optimize a PyTorch neural network operation to achieve maximum speedup
51
+ over the baseline execution. The execution is evaluated on GPU hardware and compared against:
52
+ 1. PyTorch eager mode (baseline)
53
+ 2. torch.compile() optimization
54
+
55
+ PERFORMANCE METRICS:
56
+
57
+ 1. **speedup_over_eager**: Speedup compared to PyTorch eager execution (PRIMARY OBJECTIVE - maximize)
58
+ 2. **combined_score**: Same as speedup_over_eager (used for optimization)
59
+ 3. **speedup_over_compile**: Speedup compared to torch.compile() (SECONDARY - maximize)
60
+ 4. **kernel_time_ms**: Execution time of your optimized kernel in milliseconds (minimize)
61
+ 5. **ref_eager_time_ms**: Reference eager execution time in milliseconds (for comparison)
62
+
63
+ OPTIMIZATION STRATEGIES:
64
+
65
+ - Consider writing custom kernels in CUDA or Triton
66
+ - Use efficient memory access patterns (coalesced reads/writes)
67
+ - Minimize memory transfers between CPU and GPU
68
+ - Leverage tensor cores when applicable
69
+ - Use fused operations to reduce kernel launches
70
+ - Optimize for the specific GPU architecture (H100, A100, etc.)
71
+ - Use appropriate data types (fp16, bf16, fp32)
72
+ - Minimize synchronization points
73
+
74
+ TECHNICAL REQUIREMENTS:
75
+
76
+ - **Correctness**: Your implementation must produce numerically correct results
77
+ - **Determinism**: Use fixed random seeds if employing stochastic methods
78
+ - **Error handling**: Graceful handling of edge cases and invalid inputs
79
+ - **GPU compatibility**: Code must run on the specified GPU hardware
80
+
81
+ # change the SkyDiscover default of 500 which causes the model to focus only on simplification
82
+ suggest_simplification_after_chars: 5000
83
+
84
+ evaluator:
85
+ timeout: 600
86
+ max_retries: 3
benchmarks/kernelbench/evaluator/Dockerfile ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+ WORKDIR /benchmark
3
+
4
+ # Install system dependencies
5
+ RUN apt-get update && apt-get install -y \
6
+ git \
7
+ curl \
8
+ && rm -rf /var/lib/apt/lists/*
9
+
10
+ COPY requirements.txt .
11
+ RUN pip install --no-cache-dir -r requirements.txt
12
+
13
+ # wrapper.py provides backwards compatibility for old Python-based evaluators
14
+ # that define evaluate(program_path) -> dict. Bridges them to the container
15
+ # JSON protocol. Source of truth: skydiscover/evaluation/wrapper.py
16
+ COPY . .
17
+
18
+ # Download run_and_check.py from KernelBench repository (pinned to specific commit)
19
+ ARG KERNELBENCH_COMMIT=423217d
20
+ RUN curl -o run_and_check.py \
21
+ "https://raw.githubusercontent.com/ScalingIntelligence/KernelBench/${KERNELBENCH_COMMIT}/scripts/run_and_check.py"
22
+
23
+ RUN chmod +x evaluate.sh
24
+
25
+ ENTRYPOINT ["./evaluate.sh"]
benchmarks/kernelbench/evaluator/evaluate.sh ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ PROGRAM="$1"
5
+
6
+ python /benchmark/evaluator.py "$PROGRAM"
benchmarks/kernelbench/evaluator/evaluator.py ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Evaluator for KernelBench problems using kernelbench evaluation logic.
3
+
4
+ This evaluator can run inside a Docker container or as a native Python script,
5
+ and evaluates candidate kernel programs against KernelBench reference implementations.
6
+ """
7
+
8
+ import os
9
+ import re
10
+ import subprocess
11
+ import sys
12
+ import tempfile
13
+ import traceback
14
+ from pathlib import Path
15
+
16
+
17
+ def ensure_run_and_check(evaluator_dir: Path):
18
+ """Download run_and_check.py if not present.
19
+
20
+ This allows the evaluator to work in native Python mode without Docker,
21
+ automatically fetching the KernelBench evaluation script on first use.
22
+
23
+ Args:
24
+ evaluator_dir: Directory where the evaluator is located
25
+
26
+ Returns:
27
+ Path to run_and_check.py
28
+ """
29
+ run_and_check_path = evaluator_dir / "run_and_check.py"
30
+
31
+ if not run_and_check_path.exists():
32
+ import urllib.request
33
+
34
+ commit = "423217d"
35
+ url = f"https://raw.githubusercontent.com/ScalingIntelligence/KernelBench/{commit}/scripts/run_and_check.py"
36
+
37
+ print(
38
+ f"[INFO] Downloading run_and_check.py from KernelBench (commit {commit})...",
39
+ file=sys.stderr,
40
+ )
41
+ try:
42
+ urllib.request.urlretrieve(url, run_and_check_path)
43
+ print(f"[INFO] Downloaded to {run_and_check_path}", file=sys.stderr)
44
+ except Exception as e:
45
+ raise RuntimeError(f"Failed to download run_and_check.py: {e}")
46
+
47
+ return run_and_check_path
48
+
49
+
50
+ def evaluate(program_path: str):
51
+ """
52
+ Evaluate a candidate kernel program against the reference using run_and_check.py.
53
+
54
+ Args:
55
+ program_path: Path to the candidate program file
56
+
57
+ Returns:
58
+ Dictionary with combined_score (higher is better) and optional artifacts
59
+ """
60
+ try:
61
+ # Read configuration from environment variables
62
+ # These are injected by the benchmark setup
63
+ level = int(os.environ.get("KERNELBENCH_LEVEL", "1"))
64
+ problem_id = int(os.environ.get("KERNELBENCH_PROBLEM_ID", "1"))
65
+ eval_mode = os.environ.get("KERNELBENCH_EVAL_MODE", "local")
66
+ gpu = os.environ.get("KERNELBENCH_GPU", "H100")
67
+ num_correct_trials = int(os.environ.get("KERNELBENCH_NUM_CORRECT_TRIALS", "5"))
68
+ num_perf_trials = int(os.environ.get("KERNELBENCH_NUM_PERF_TRIALS", "100"))
69
+ timeout = int(os.environ.get("KERNELBENCH_TIMEOUT", "300"))
70
+
71
+ # Read the program and wrap it in ModelNew class for KernelBench format
72
+ with open(program_path, "r") as f:
73
+ program_content = f.read()
74
+
75
+ is_triton = bool(
76
+ re.search(r"^(import triton|from triton)", program_content, flags=re.MULTILINE)
77
+ )
78
+
79
+ # Create a temporary file with ModelNew wrapper
80
+ with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as tmp_file:
81
+ # Replace class Model with class ModelNew (if not already ModelNew)
82
+ converted_content = program_content
83
+ if "class ModelNew" not in converted_content:
84
+ converted_content = re.sub(
85
+ r"^class Model(?=[(:])", "class ModelNew", converted_content, flags=re.MULTILINE
86
+ )
87
+ # Fix super() calls - use modern Python 3 super() without arguments
88
+ converted_content = re.sub(r"super\(Model,\s*self\)", "super()", converted_content)
89
+ converted_content = re.sub(r"super\(Model,\s*cls\)", "super()", converted_content)
90
+
91
+ tmp_file.write(converted_content)
92
+ kernel_src_path = tmp_file.name
93
+
94
+ try:
95
+ # Ensure run_and_check.py is available (downloads if needed)
96
+ evaluator_dir = Path(__file__).parent
97
+ run_and_check_path = ensure_run_and_check(evaluator_dir)
98
+
99
+ # Build command to run run_and_check.py
100
+ cmd = [
101
+ sys.executable,
102
+ str(run_and_check_path),
103
+ "ref_origin=kernelbench",
104
+ f"level={level}",
105
+ f"problem_id={problem_id}",
106
+ f"kernel_src_path={kernel_src_path}",
107
+ f"eval_mode={eval_mode}",
108
+ f"gpu={gpu}",
109
+ f"num_correct_trials={num_correct_trials}",
110
+ f"num_perf_trials={num_perf_trials}",
111
+ f"timeout={timeout}",
112
+ "check_kernel=False", # Disable static checker to allow reference code
113
+ ]
114
+
115
+ # Setting the backend is important for KernelBench triton evaluation to work
116
+ if is_triton:
117
+ cmd.append("backend=triton")
118
+
119
+ # Set up environment
120
+ env = os.environ.copy()
121
+
122
+ # Run the evaluation from the evaluator directory
123
+ print(f"[INFO] Running evaluation command: {' '.join(cmd)}", file=sys.stderr)
124
+ result = subprocess.run(
125
+ cmd,
126
+ capture_output=True,
127
+ text=True,
128
+ timeout=timeout,
129
+ cwd=str(evaluator_dir),
130
+ env=env,
131
+ )
132
+ finally:
133
+ # Clean up temporary file
134
+ try:
135
+ os.unlink(kernel_src_path)
136
+ except Exception:
137
+ pass
138
+
139
+ # Parse the output to extract speedup
140
+ stdout = result.stdout
141
+ stderr = result.stderr
142
+
143
+ if result.returncode != 0:
144
+ print(
145
+ f"[ERROR] Evaluation failed with return code {result.returncode}", file=sys.stderr
146
+ )
147
+ print(f"[ERROR] stdout: {stdout}", file=sys.stderr)
148
+ print(f"[ERROR] stderr: {stderr}", file=sys.stderr)
149
+ return {
150
+ "combined_score": -100.0,
151
+ "error": f"Evaluation subprocess failed: {stderr[:500]}",
152
+ "return_code": result.returncode,
153
+ }
154
+
155
+ # Extract speedup from output
156
+ speedup_eager = None
157
+ speedup_compile = None
158
+ kernel_time = None
159
+ ref_eager_time = None
160
+
161
+ for line in stdout.split("\n"):
162
+ if "Speedup over eager:" in line:
163
+ match = re.search(r"([0-9.]+)x", line)
164
+ if match:
165
+ speedup_eager = float(match.group(1))
166
+ elif "Speedup over torch.compile:" in line:
167
+ match = re.search(r"([0-9.]+)x", line)
168
+ if match:
169
+ speedup_compile = float(match.group(1))
170
+ elif "Custom Kernel exec time:" in line:
171
+ match = re.search(r"([0-9.]+) ms", line)
172
+ if match:
173
+ kernel_time = float(match.group(1))
174
+ elif "PyTorch Reference Eager exec time:" in line:
175
+ match = re.search(r"([0-9.]+) ms", line)
176
+ if match:
177
+ ref_eager_time = float(match.group(1))
178
+
179
+ # If we found speedup, use it as the score
180
+ if speedup_eager is not None and speedup_eager > 0:
181
+ return {
182
+ "combined_score": float(speedup_eager),
183
+ "speedup_over_eager": speedup_eager,
184
+ "speedup_over_compile": speedup_compile,
185
+ "kernel_time_ms": kernel_time,
186
+ "ref_eager_time_ms": ref_eager_time,
187
+ "eval_mode": eval_mode,
188
+ "gpu": gpu,
189
+ }
190
+ else:
191
+ # Kernel failed correctness or didn't compile
192
+ # Extract only relevant output starting from [Eval]
193
+ stdout_excerpt = stdout
194
+ if "[Eval]" in stdout:
195
+ eval_start = stdout.find("[Eval]")
196
+ stdout_excerpt = stdout[eval_start:]
197
+
198
+ # Take last 5000 chars if too long
199
+ if len(stdout_excerpt) > 5000:
200
+ stdout_excerpt = stdout_excerpt[-5000:]
201
+
202
+ return {
203
+ "combined_score": -100.0,
204
+ "error": "Kernel failed correctness check or did not compile",
205
+ "stdout_excerpt": stdout_excerpt,
206
+ }
207
+
208
+ except subprocess.TimeoutExpired:
209
+ return {
210
+ "combined_score": -1.0,
211
+ "error": f"Evaluation timed out after {timeout} seconds",
212
+ }
213
+ except Exception as e:
214
+ traceback.print_exc()
215
+ return {
216
+ "combined_score": -100.0,
217
+ "error": f"Error during evaluation: {str(e)}",
218
+ "error_type": type(e).__name__,
219
+ }
220
+
221
+
222
+ if __name__ == "__main__":
223
+ # Backwards-compat: bridges old evaluate() -> dict to the container JSON
224
+ # protocol. wrapper.py is copied from skydiscover/evaluation/wrapper.py.
225
+ from wrapper import run
226
+
227
+ run(evaluate)
benchmarks/kernelbench/evaluator/requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # KernelBench library with GPU support
2
+ kernelbench[gpu] @ git+https://github.com/ScalingIntelligence/KernelBench.git
benchmarks/kernelbench/evaluator/wrapper.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Backwards-compat wrapper for old Python-based evaluators.
2
+
3
+ Old-style evaluators define ``evaluate(program_path) -> dict``. This module
4
+ bridges that interface to the container JSON protocol expected by
5
+ ContainerizedEvaluator.
6
+
7
+ Usage — add this to the bottom of your evaluator.py::
8
+
9
+ if __name__ == "__main__":
10
+ from wrapper import run
11
+ run(evaluate)
12
+ """
13
+
14
+ import json
15
+ import sys
16
+ import traceback
17
+
18
+
19
+ def run(evaluate_fn):
20
+ """Call *evaluate_fn*, format the result as container-protocol JSON on stdout.
21
+
22
+ * Reads ``sys.argv[1]`` as the program path.
23
+ * Redirects stdout → stderr while *evaluate_fn* runs so that debug prints
24
+ don't contaminate the JSON output.
25
+ * Separates numeric metrics from non-numeric artifacts.
26
+ * Guarantees ``combined_score`` is always present in metrics.
27
+ """
28
+ if len(sys.argv) < 2:
29
+ print("Usage: evaluator.py <program_path>", file=sys.stderr)
30
+ sys.exit(1)
31
+
32
+ program_path = sys.argv[1]
33
+
34
+ # Redirect stdout → stderr during evaluation so debug prints from
35
+ # the evaluator don't contaminate the JSON output on stdout.
36
+ real_stdout = sys.stdout
37
+ sys.stdout = sys.stderr
38
+ try:
39
+ result = evaluate_fn(program_path)
40
+ except Exception as e:
41
+ sys.stdout = real_stdout
42
+ print(
43
+ json.dumps(
44
+ {
45
+ "status": "error",
46
+ "combined_score": 0.0,
47
+ "metrics": {"combined_score": 0.0},
48
+ "artifacts": {
49
+ "error": str(e),
50
+ "traceback": traceback.format_exc(),
51
+ },
52
+ }
53
+ )
54
+ )
55
+ return
56
+ sys.stdout = real_stdout
57
+
58
+ if not isinstance(result, dict):
59
+ print(
60
+ json.dumps(
61
+ {
62
+ "status": "error",
63
+ "combined_score": 0.0,
64
+ "metrics": {"combined_score": 0.0},
65
+ "artifacts": {
66
+ "error": f"evaluate() returned {type(result).__name__}, expected dict"
67
+ },
68
+ }
69
+ )
70
+ )
71
+ return
72
+
73
+ # Separate numeric metrics from non-numeric artifacts.
74
+ metrics = {}
75
+ artifacts = {}
76
+ for k, v in result.items():
77
+ if isinstance(v, bool):
78
+ metrics[k] = float(v)
79
+ elif isinstance(v, (int, float)):
80
+ metrics[k] = float(v)
81
+ elif isinstance(v, str):
82
+ artifacts[k] = v
83
+ elif isinstance(v, (list, dict)):
84
+ artifacts[k] = json.dumps(v)
85
+
86
+ if "combined_score" not in metrics:
87
+ metrics["combined_score"] = 0.0
88
+
89
+ status = "error" if "error" in artifacts else "success"
90
+ output = {
91
+ "status": status,
92
+ "combined_score": metrics["combined_score"],
93
+ "metrics": metrics,
94
+ }
95
+ if artifacts:
96
+ output["artifacts"] = artifacts
97
+
98
+ print(json.dumps(output))
benchmarks/kernelbench/requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ # KernelBench library for problem fetching (resolver)
2
+ # Note: The evaluator uses kernelbench[gpu] which includes GPU support
3
+ # For resolver-only usage (fetching problems), the base package is sufficient
4
+ kernelbench @ git+https://github.com/ScalingIntelligence/KernelBench.git
benchmarks/kernelbench/resolver.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """KernelBench problem resolver for SkyDiscover.
2
+
3
+ This resolver fetches GPU kernel optimization problems from the KernelBench
4
+ dataset and generates the necessary files for SkyDiscover to run optimization.
5
+ """
6
+
7
+ import logging
8
+ from pathlib import Path
9
+ from typing import Any, Dict, Tuple
10
+
11
+ from skydiscover.benchmarks.base import BenchmarkResolution, BenchmarkResolver
12
+ from skydiscover.utils.prepare import prepare_program
13
+
14
+ logger = logging.getLogger(__name__)
15
+
16
+
17
+ class KernelBenchResolver(BenchmarkResolver):
18
+ """Resolves KernelBench problems by fetching from dataset and generating files.
19
+
20
+ The resolver:
21
+ 1. Fetches the reference implementation from KernelBench dataset
22
+ 2. Generates initial_program.py with EVOLVE-BLOCK markers
23
+ 3. Sets environment variables for the evaluator
24
+ 4. Returns paths to the generated initial program and existing evaluator
25
+
26
+ Required config parameters:
27
+ - level: Problem difficulty level (1, 2, or 3)
28
+ - problem_id: Specific problem ID within the level
29
+
30
+ Optional config parameters:
31
+ - dataset_src: 'huggingface' (default) or 'local'
32
+ - dataset_name: HuggingFace dataset name (default: 'ScalingIntelligence/KernelBench')
33
+ - eval_mode: 'local' (default) or 'modal'
34
+ - gpu: GPU type for evaluation (default: 'H100')
35
+ - num_correct_trials: Number of correctness validation runs (default: 5)
36
+ - num_perf_trials: Number of performance measurement runs (default: 100)
37
+ """
38
+
39
+ def resolve(self, config: Dict[str, Any], output_dir: Path) -> BenchmarkResolution:
40
+ """Fetch KernelBench problem and generate initial_program + configure evaluator.
41
+
42
+ Args:
43
+ config: Configuration dictionary with 'level', 'problem_id', and optional params
44
+ output_dir: Directory where generated files will be placed
45
+
46
+ Returns:
47
+ BenchmarkResolution with initial program, evaluator path, and evaluator env vars
48
+ """
49
+ # Validate required parameters
50
+ level = config.get("level")
51
+ problem_id = config.get("problem_id")
52
+
53
+ if level is None or problem_id is None:
54
+ raise ValueError(
55
+ "KernelBench resolver requires 'level' and 'problem_id' in config. "
56
+ f"Got: level={level}, problem_id={problem_id}"
57
+ )
58
+
59
+ # Extract optional parameters with defaults
60
+ dataset_src = config.get("dataset_src", "huggingface")
61
+ dataset_name = config.get("dataset_name", "ScalingIntelligence/KernelBench")
62
+ eval_mode = config.get("eval_mode", "local")
63
+ gpu = config.get("gpu", "H100")
64
+ num_correct_trials = config.get("num_correct_trials", 5)
65
+ num_perf_trials = config.get("num_perf_trials", 100)
66
+
67
+ logger.info(f"Resolving KernelBench problem: level={level}, problem_id={problem_id}")
68
+ logger.info(f"Eval mode: {eval_mode}, GPU: {gpu}")
69
+
70
+ # Import KernelBench dataset utilities
71
+ try:
72
+ from kernelbench.dataset import construct_kernelbench_dataset
73
+ except ImportError as e:
74
+ raise ImportError(
75
+ "KernelBench package not found. Install with: "
76
+ "uv pip install 'kernelbench @ git+https://github.com/ScalingIntelligence/KernelBench.git'"
77
+ ) from e
78
+
79
+ # Fetch the problem from KernelBench dataset
80
+ try:
81
+ dataset = construct_kernelbench_dataset(
82
+ level=level,
83
+ source=dataset_src,
84
+ dataset_name=dataset_name,
85
+ )
86
+ problem = dataset.get_problem_by_id(problem_id)
87
+ except Exception as e:
88
+ raise RuntimeError(
89
+ f"Failed to fetch KernelBench problem (level={level}, id={problem_id}): {e}"
90
+ ) from e
91
+
92
+ logger.info(f"Fetched problem: {problem.name} (ID: {problem.problem_id})")
93
+
94
+ # Generate initial_program.py with EVOLVE-BLOCK markers using prepare_program
95
+ output_dir.mkdir(parents=True, exist_ok=True)
96
+ initial_program_path = prepare_program(
97
+ initial_program=problem.code, temp_dir=str(output_dir), temp_files=[]
98
+ )
99
+ logger.info(f"Generated initial program: {initial_program_path}")
100
+
101
+ use_docker = config.get("use_docker", True)
102
+
103
+ # Use evaluator.py file for native mode, directory for container mode
104
+ if use_docker:
105
+ evaluator_path = Path(__file__).parent / "evaluator"
106
+ logger.info("Using containerized evaluator (Docker required)")
107
+ else:
108
+ evaluator_path = Path(__file__).parent / "evaluator" / "evaluator.py"
109
+ logger.info("Using native Python evaluator (no Docker required)")
110
+
111
+ evaluator_env_vars = {
112
+ "KERNELBENCH_LEVEL": str(level),
113
+ "KERNELBENCH_PROBLEM_ID": str(problem_id),
114
+ "KERNELBENCH_EVAL_MODE": eval_mode,
115
+ "KERNELBENCH_GPU": gpu,
116
+ "KERNELBENCH_NUM_CORRECT_TRIALS": str(num_correct_trials),
117
+ "KERNELBENCH_NUM_PERF_TRIALS": str(num_perf_trials),
118
+ "KERNELBENCH_TIMEOUT": str(config.get("timeout", 300)),
119
+ }
120
+
121
+ mode_desc = "container" if use_docker else "native evaluator"
122
+ logger.info(f"Prepared evaluator environment for {mode_desc}:")
123
+ logger.info(f" KERNELBENCH_LEVEL={level}")
124
+ logger.info(f" KERNELBENCH_PROBLEM_ID={problem_id}")
125
+ logger.info(f" KERNELBENCH_EVAL_MODE={eval_mode}")
126
+ logger.info(f" KERNELBENCH_GPU={gpu}")
127
+
128
+ return BenchmarkResolution(
129
+ initial_program_path=str(initial_program_path),
130
+ evaluator_path=str(evaluator_path),
131
+ evaluator_env_vars=evaluator_env_vars,
132
+ )
133
+
134
+
135
+ # Module-level resolver instance
136
+ resolver = KernelBenchResolver()
benchmarks/math/first_autocorr_ineq/evaluator/evaluate.sh ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ PROGRAM="$1"
5
+ # MODE ($2) accepted but ignored — override this file to use train/test splits.
6
+
7
+ python /benchmark/evaluator.py "$PROGRAM"
benchmarks/math/second_autocorr_ineq/evaluator/evaluate.sh ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ PROGRAM="$1"
5
+ # MODE ($2) accepted but ignored — override this file to use train/test splits.
6
+
7
+ python /benchmark/evaluator.py "$PROGRAM"
benchmarks/math/second_autocorr_ineq/evaluator/evaluator.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ===--------------------------------------------------------------------------------------===#
2
+ #
3
+ # This file implements the evaluator for the second autocorrelation inequality problem.
4
+ #
5
+ # ===--------------------------------------------------------------------------------------===#
6
+ #
7
+ # Some of the code in this file is adapted from:
8
+ #
9
+ # google-deepmind/alphaevolve_results:
10
+ # Licensed under the Apache License v2.0.
11
+ #
12
+ # ===--------------------------------------------------------------------------------------===#
13
+
14
+ import sys
15
+ import os
16
+ from importlib import __import__
17
+ import time
18
+ import numpy as np
19
+
20
+ BENCHMARK = 0.8962799441554086
21
+
22
+
23
+ def verify_c2_solution(f_values: np.ndarray, c2_achieved_from_opt: float, n_points: int):
24
+ """
25
+ Verifies the C2 lower bound solution using the rigorous, unitless, piecewise linear integral method.
26
+ """
27
+ if f_values.shape != (n_points,):
28
+ raise ValueError(f"Expected function values shape {(n_points,)}. Got {f_values.shape}.")
29
+
30
+ if np.any(f_values < -1e-6): # Allow for small floating point errors
31
+ raise ValueError("Function must be non-negative.")
32
+
33
+ f_nonneg = np.maximum(f_values, 0.0)
34
+ # The raw, unscaled convolution is used
35
+ convolution = np.convolve(f_nonneg, f_nonneg, mode="full")
36
+
37
+ # Calculate the L2-norm squared: ||f*f||_2^2 via piecewise linear integration
38
+ num_conv_points = len(convolution)
39
+ x_points = np.linspace(-0.5, 0.5, num_conv_points + 2)
40
+ x_intervals = np.diff(x_points)
41
+ y_points = np.concatenate(([0], convolution, [0]))
42
+ l2_norm_squared = 0.0
43
+ for i in range(len(convolution) + 1):
44
+ y1, y2, h = y_points[i], y_points[i + 1], x_intervals[i]
45
+ interval_l2_squared = (h / 3) * (y1**2 + y1 * y2 + y2**2)
46
+ l2_norm_squared += interval_l2_squared
47
+
48
+ # Calculate the L1-norm: ||f*f||_1
49
+ # This is an approximation of the integral of the absolute value of the autoconvolution
50
+ norm_1 = np.sum(np.abs(convolution)) / (len(convolution) + 1)
51
+
52
+ # Calculate the infinity-norm: ||f*f||_inf
53
+ norm_inf = np.max(np.abs(convolution))
54
+
55
+ computed_c2 = l2_norm_squared / (norm_1 * norm_inf)
56
+
57
+ return computed_c2
58
+
59
+
60
+ def evaluate(program_path: str):
61
+ try:
62
+ abs_program_path = os.path.abspath(program_path)
63
+ program_dir = os.path.dirname(abs_program_path)
64
+ module_name = os.path.splitext(os.path.basename(program_path))[0]
65
+
66
+ try:
67
+ sys.path.insert(0, program_dir)
68
+ program = __import__(module_name)
69
+ start_time = time.time()
70
+ f_values, c2_achieved_from_opt, loss, n_points = program.run()
71
+ end_time = time.time()
72
+ eval_time = end_time - start_time
73
+ finally:
74
+ if program_dir in sys.path:
75
+ sys.path.remove(program_dir)
76
+
77
+ c2_verified = verify_c2_solution(f_values, c2_achieved_from_opt, n_points)
78
+
79
+ return {
80
+ "c2": float(c2_verified),
81
+ "combined_score": float(c2_verified) / BENCHMARK,
82
+ "loss": float(loss),
83
+ "n_points": int(n_points),
84
+ "eval_time": float(eval_time),
85
+ }
86
+ except Exception as e:
87
+ return {"combined_score": 0.0, "error": str(e)}
88
+
89
+
90
+ if __name__ == "__main__":
91
+ # Backwards-compat: bridges old evaluate() -> dict to the container JSON
92
+ # protocol. wrapper.py is copied from skydiscover/evaluation/wrapper.py.
93
+ from wrapper import run
94
+
95
+ run(evaluate)
benchmarks/math/second_autocorr_ineq/evaluator/requirements.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ numpy
2
+ scipy
3
+ sympy
4
+ jax
5
+ torch
6
+ optax
7
+ scikit-learn
8
+ numba
9
+ pandas
10
+ matplotlib
11
+ plotly
12
+ networkx
13
+ cvxpy
14
+ autograd
15
+ pymoo
benchmarks/math/second_autocorr_ineq/evaluator/wrapper.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Backwards-compat wrapper for old Python-based evaluators.
2
+
3
+ Old-style evaluators define ``evaluate(program_path) -> dict``. This module
4
+ bridges that interface to the container JSON protocol expected by
5
+ ContainerizedEvaluator.
6
+
7
+ Usage — add this to the bottom of your evaluator.py::
8
+
9
+ if __name__ == "__main__":
10
+ from wrapper import run
11
+ run(evaluate)
12
+ """
13
+
14
+ import json
15
+ import sys
16
+ import traceback
17
+
18
+
19
+ def run(evaluate_fn):
20
+ """Call *evaluate_fn*, format the result as container-protocol JSON on stdout.
21
+
22
+ * Reads ``sys.argv[1]`` as the program path.
23
+ * Redirects stdout → stderr while *evaluate_fn* runs so that debug prints
24
+ don't contaminate the JSON output.
25
+ * Separates numeric metrics from non-numeric artifacts.
26
+ * Guarantees ``combined_score`` is always present in metrics.
27
+ """
28
+ if len(sys.argv) < 2:
29
+ print("Usage: evaluator.py <program_path>", file=sys.stderr)
30
+ sys.exit(1)
31
+
32
+ program_path = sys.argv[1]
33
+
34
+ # Redirect stdout → stderr during evaluation so debug prints from
35
+ # the evaluator don't contaminate the JSON output on stdout.
36
+ real_stdout = sys.stdout
37
+ sys.stdout = sys.stderr
38
+ try:
39
+ result = evaluate_fn(program_path)
40
+ except Exception as e:
41
+ sys.stdout = real_stdout
42
+ print(
43
+ json.dumps(
44
+ {
45
+ "status": "error",
46
+ "combined_score": 0.0,
47
+ "metrics": {"combined_score": 0.0},
48
+ "artifacts": {
49
+ "error": str(e),
50
+ "traceback": traceback.format_exc(),
51
+ },
52
+ }
53
+ )
54
+ )
55
+ return
56
+ sys.stdout = real_stdout
57
+
58
+ if not isinstance(result, dict):
59
+ print(
60
+ json.dumps(
61
+ {
62
+ "status": "error",
63
+ "combined_score": 0.0,
64
+ "metrics": {"combined_score": 0.0},
65
+ "artifacts": {
66
+ "error": f"evaluate() returned {type(result).__name__}, expected dict"
67
+ },
68
+ }
69
+ )
70
+ )
71
+ return
72
+
73
+ # Separate numeric metrics from non-numeric artifacts.
74
+ metrics = {}
75
+ artifacts = {}
76
+ for k, v in result.items():
77
+ if isinstance(v, bool):
78
+ metrics[k] = float(v)
79
+ elif isinstance(v, (int, float)):
80
+ metrics[k] = float(v)
81
+ elif isinstance(v, str):
82
+ artifacts[k] = v
83
+ elif isinstance(v, (list, dict)):
84
+ artifacts[k] = json.dumps(v)
85
+
86
+ if "combined_score" not in metrics:
87
+ metrics["combined_score"] = 0.0
88
+
89
+ status = "error" if "error" in artifacts else "success"
90
+ output = {
91
+ "status": status,
92
+ "combined_score": metrics["combined_score"],
93
+ "metrics": metrics,
94
+ }
95
+ if artifacts:
96
+ output["artifacts"] = artifacts
97
+
98
+ print(json.dumps(output))
docs/.gitignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ node_modules/
2
+ .next/
3
+ .source/
docs/README.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SkyDiscover Documentation
2
+
3
+ Built with [Next.js](https://nextjs.org/) + [Fumadocs](https://fumadocs.vercel.app/).
4
+
5
+ ## Local Development
6
+
7
+ ```bash
8
+ cd docs
9
+ npm install
10
+ npm run dev
11
+ ```
12
+
13
+ Then open [http://localhost:3000](http://localhost:3000).
docs/app/api/search/route.ts ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import { source } from '@/lib/source';
2
+ import { createFromSource } from 'fumadocs-core/search/server';
3
+
4
+ export const { GET } = createFromSource(source, {
5
+ language: 'english',
6
+ });
docs/app/docs/[[...slug]]/page.tsx ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { source } from '@/lib/source';
2
+ import { DocsBody, DocsDescription, DocsPage, DocsTitle } from 'fumadocs-ui/layouts/docs/page';
3
+ import { notFound } from 'next/navigation';
4
+ import { getMDXComponents } from '@/mdx-components';
5
+ import type { Metadata } from 'next';
6
+ import { createRelativeLink } from 'fumadocs-ui/mdx';
7
+
8
+ export default async function Page(props: PageProps<'/docs/[[...slug]]'>) {
9
+ const params = await props.params;
10
+ const page = source.getPage(params.slug);
11
+ if (!page) notFound();
12
+
13
+ const MDX = page.data.body;
14
+
15
+ return (
16
+ <DocsPage toc={page.data.toc} full={page.data.full}>
17
+ <DocsTitle>{page.data.title}</DocsTitle>
18
+ <DocsDescription>{page.data.description}</DocsDescription>
19
+ <DocsBody>
20
+ <MDX
21
+ components={getMDXComponents({
22
+ a: createRelativeLink(source, page),
23
+ })}
24
+ />
25
+ </DocsBody>
26
+ </DocsPage>
27
+ );
28
+ }
29
+
30
+ export async function generateStaticParams() {
31
+ return source.generateParams();
32
+ }
33
+
34
+ export async function generateMetadata(props: PageProps<'/docs/[[...slug]]'>): Promise<Metadata> {
35
+ const params = await props.params;
36
+ const page = source.getPage(params.slug);
37
+ if (!page) notFound();
38
+
39
+ return {
40
+ title: page.data.title,
41
+ description: page.data.description,
42
+ };
43
+ }
docs/app/docs/layout.tsx ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { source } from '@/lib/source';
2
+ import { DocsLayout } from 'fumadocs-ui/layouts/docs';
3
+ import { baseOptions } from '@/lib/layout.shared';
4
+
5
+ export default function Layout({ children }: LayoutProps<'/docs'>) {
6
+ return (
7
+ <DocsLayout tree={source.getPageTree()} {...baseOptions()}>
8
+ {children}
9
+ </DocsLayout>
10
+ );
11
+ }
docs/app/global.css ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ @import 'tailwindcss';
2
+ @import 'fumadocs-ui/css/neutral.css';
3
+ @import 'fumadocs-ui/css/preset.css';
docs/app/layout.tsx ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { RootProvider } from 'fumadocs-ui/provider/next';
2
+ import './global.css';
3
+ import { Inter } from 'next/font/google';
4
+
5
+ const inter = Inter({
6
+ subsets: ['latin'],
7
+ });
8
+
9
+ export default function Layout({ children }: LayoutProps<'/'>) {
10
+ return (
11
+ <html lang="en" className={inter.className} suppressHydrationWarning>
12
+ <body className="flex flex-col min-h-screen">
13
+ <RootProvider>{children}</RootProvider>
14
+ </body>
15
+ </html>
16
+ );
17
+ }
docs/app/page.tsx ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import Link from 'next/link';
2
+
3
+ export default function HomePage() {
4
+ return (
5
+ <main className="flex min-h-screen flex-col items-center justify-center p-24">
6
+ <div className="z-10 max-w-5xl w-full items-center justify-center font-mono text-sm">
7
+ <h1 className="text-4xl font-bold mb-8 text-center">SkyDiscover Documentation</h1>
8
+ <p className="text-xl mb-8 text-center">
9
+ Documentation for SkyDiscover.
10
+ </p>
11
+ <div className="flex justify-center">
12
+ <Link
13
+ href="/docs"
14
+ className="inline-block bg-blue-600 hover:bg-blue-700 text-white font-bold py-3 px-6 rounded text-lg"
15
+ >
16
+ View Documentation
17
+ </Link>
18
+ </div>
19
+ </div>
20
+ </main>
21
+ );
22
+ }
docs/content/docs/getting-started/configuration.mdx ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Configuration"
3
+ description: "Models, algorithms, config files, and the Python API."
4
+ ---
5
+
6
+ ## Pick a model
7
+
8
+ SkyDiscover uses [LiteLLM](https://docs.litellm.ai/) under the hood, so any
9
+ compatible model works:
10
+
11
+ ```bash
12
+ --model gpt-5 # OpenAI (default)
13
+ --model gemini/gemini-3-pro-preview # Google Gemini
14
+ --model anthropic/claude-sonnet-4-6 # Anthropic
15
+ --model ollama/llama3 --api-base http://localhost:11434/v1 # Local
16
+ ```
17
+
18
+ ## Pick an algorithm
19
+
20
+ | Algorithm | Flag | Best for |
21
+ |:---|:---|:---|
22
+ | **EvoX** | `--search evox` | Self-evolving strategy that adapts how it searches |
23
+ | **AdaEvolve** | `--search adaevolve` | Multi-island adaptive search with breakthroughs |
24
+ | **Best-of-N** | `--search best_of_n` | Quick baseline — generates N variants, keeps the best |
25
+ | **Top-K** | `--search topk` | Iteratively refines the top K solutions |
26
+ | **Beam Search** | `--search beam_search` | Breadth-first expansion of promising solutions |
27
+
28
+ ## Config files
29
+
30
+ Pass a YAML config with `--config` for full control:
31
+
32
+ ```yaml
33
+ max_iterations: 100
34
+ llm:
35
+ models:
36
+ - name: "gpt-5"
37
+ weight: 0.7
38
+ - name: "gemini/gemini-3-pro-preview"
39
+ weight: 0.3
40
+ max_tokens: 8192
41
+ timeout: 120
42
+
43
+ search:
44
+ type: "evox"
45
+
46
+ prompt:
47
+ system_message: |
48
+ You are an expert at optimizing algorithms.
49
+
50
+ evaluator:
51
+ timeout: 60
52
+ ```
53
+
54
+ Multi-model pools with weighted sampling let you blend different LLMs —
55
+ SkyDiscover samples from them probabilistically each iteration.
56
+
57
+ ## Live dashboard
58
+
59
+ Add this to your config to get a real-time monitoring dashboard:
60
+
61
+ ```yaml
62
+ monitor:
63
+ enabled: true
64
+ ```
65
+
66
+ The dashboard URL prints at run start — it shows a scatter plot of all
67
+ programs, code diffs, metrics, and lets you give human feedback to steer
68
+ evolution in real time.
69
+
70
+ ## Python API
71
+
72
+ You can also run SkyDiscover programmatically:
73
+
74
+ ```python
75
+ from skydiscover import run_discovery
76
+
77
+ result = run_discovery(
78
+ initial_program="initial_program.py",
79
+ evaluator="evaluator.py",
80
+ search="evox",
81
+ model="gpt-5",
82
+ iterations=50,
83
+ )
84
+
85
+ print(result.best_score, result.best_solution)
86
+ ```
87
+
88
+ Or use the inline convenience wrapper for quick experiments:
89
+
90
+ ```python
91
+ from skydiscover import discover_solution
92
+
93
+ result = discover_solution(
94
+ initial_solution="def solve(x): return x",
95
+ evaluator=lambda path: {"combined_score": run_tests(path)},
96
+ iterations=50,
97
+ search="evox",
98
+ )
99
+ ```
100
+
101
+ ## CLI reference
102
+
103
+ ```
104
+ uv run skydiscover-run [INITIAL_PROGRAM] EVALUATOR [options]
105
+ ```
106
+
107
+ | Flag | Description |
108
+ |:---|:---|
109
+ | `-c, --config FILE` | Config YAML |
110
+ | `-i, --iterations N` | Number of iterations |
111
+ | `-m, --model MODEL` | LLM model (overrides config) |
112
+ | `-s, --search TYPE` | Search algorithm |
113
+ | `-o, --output DIR` | Output directory |
114
+ | `--api-base URL` | Override LLM API endpoint |
115
+ | `--checkpoint DIR` | Resume from checkpoint |
116
+ | `--agentic` | Enable agentic mode (LLM can read your files) |
117
+ | `-l, --log-level LEVEL` | DEBUG, INFO, WARNING, or ERROR |
docs/content/docs/getting-started/index.mdx ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Getting Started"
3
+ description: "Set up SkyDiscover, run your first discovery task, and learn how to configure it."
4
+ ---
5
+
6
+ Get up and running with SkyDiscover in a few minutes. This section covers
7
+ everything you need to go from zero to your first AI-driven discovery:
8
+
9
+ - **[Installation](/docs/getting-started/installation)** — install SkyDiscover and set up your API keys
10
+ - **[Quick Start](/docs/getting-started/quick-start)** — run your first task and understand the core workflow
11
+ - **[Configuration](/docs/getting-started/configuration)** — models, algorithms, config files, and the Python API