Benchmarks
~200 optimization tasks across math, systems, algorithms, and reasoning.
Quick Start
Install dependencies first:
# Base
uv sync
# Choose extras based on what you run
uv sync --extra external # openevolve/gepa/shinkaevolve
uv sync --extra math # math benchmarks
uv sync --extra adrs # ADRS benchmarks
uv sync --extra frontier-cs # frontier-cs-eval benchmark
uv sync --extra prompt-optimization # HotPotQA prompt benchmark
If a benchmark directory has requirements.txt, also run:
uv pip install -r benchmarks/<task>/requirements.txt
Then run:
export OPENAI_API_KEY="..."
# Containerized benchmark (recommended β evaluator runs in Docker)
uv run skydiscover-run benchmarks/math/circle_packing_rect/initial_program.py \
benchmarks/math/circle_packing_rect/evaluator \
-c benchmarks/math/circle_packing_rect/config.yaml \
-s best_of_n -i 50
# Plain Python evaluator (runs on host)
uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
benchmarks/math/circle_packing/evaluator.py \
-c benchmarks/math/circle_packing/config.yaml \
-s best_of_n -i 100
Tasks
| Benchmark | Domain | Tasks | What it tests |
|---|---|---|---|
math/ |
Math | 14 | Circle packing, Erdos problems, autocorrelation inequalities, geometric optimization |
ADRS/ |
Systems | 5 | Cloud scheduling, MoE load balancing, model placement, column reordering, transaction scheduling |
gpu_mode/ |
GPU | 4 | Triton kernel optimization (vecadd, grayscale, trimul, MLA decode) |
frontier-cs-eval/ |
Algorithms | 172 | Competitive programming (Frontier-CS benchmark, Docker judge) |
arc_benchmark/ |
Reasoning | β | ARC-AGI visual reasoning tasks |
ale_bench/ |
Algorithms | 10 | Algorithmic contest problems (C++, ALE-Bench) |
image_gen/ |
Creative | 1 | AI image generation evolution |
prompt_optimization/ |
Prompts | 1 | Evolve natural-language prompts, not code (HotPotQA) |
Each benchmark directory has its own README with setup and run instructions.
Structure
There are three ways to set up a benchmark: a containerized evaluator (recommended for new benchmarks), a Harbor task (for external benchmark suites), or a plain Python evaluator (simplest).
Containerized evaluator (recommended)
<task>/
βββ initial_program.py # Starting solution
βββ config.yaml # System prompt + search/evaluator settings
βββ evaluator/ # Self-contained Docker benchmark
βββ Dockerfile
βββ evaluate.sh # Entrypoint (receives solution path + mode)
βββ evaluator.py # Scoring logic
βββ requirements.txt # Python dependencies
βββ ... # Any other data/files the evaluator needs
The evaluator/ directory is the Docker build context. Everything inside it gets copied into the image β data files, model weights, test fixtures, etc. SkyDiscover auto-detects this layout when evaluation_file points to a directory containing a Dockerfile and evaluate.sh.
Plain Python evaluator
<task>/
βββ initial_program.py # Starting solution
βββ evaluator.py # Scoring function (returns combined_score)
βββ config.yaml # System prompt + search/evaluator settings
Simpler but runs evaluator code directly on the host. Fine for pure-Python tasks with no system dependencies.
Benchmark resolvers (dynamic problem loading)
Some benchmarks support dynamic problem loading through a resolver pattern. Instead of providing a static initial_program.py, the resolver fetches problems from an external dataset based on configuration parameters.
This is useful for benchmark suites with many problems (e.g., KernelBench has hundreds of GPU kernel optimization tasks). The resolver pattern allows you to:
- Select specific problems via config parameters (e.g., difficulty level, problem ID)
- Automatically generate the initial program from the benchmark dataset
- Configure evaluator settings based on the problem specification
Using a benchmark with a resolver
Benchmarks that support resolvers include a benchmark section in their config.yaml:
benchmark:
enabled: true # Enable benchmark loader
name: kernelbench # Benchmark name (for logging)
resolver: benchmarks.kernelbench.resolver # Python module path
# Benchmark-specific parameters
level: 2 # Example: difficulty level
problem_id: 5 # Example: specific problem ID
When running such a benchmark, you don't need to provide an initial_program argument:
uv run skydiscover-run benchmarks/kernelbench/evaluator/ \
-c benchmarks/kernelbench/config.yaml \
--search adaevolve \
--iterations 50
The resolver automatically fetches the problem and generates the initial program based on the config parameters.
Implementing a benchmark resolver
To add resolver support to a new benchmark:
- Create
benchmarks/your_benchmark/resolver.pyimplementing theBenchmarkResolverinterface:
from pathlib import Path
from typing import Any, Dict, Tuple
from skydiscover.benchmarks.base import BenchmarkResolver
class YourBenchmarkResolver(BenchmarkResolver):
def resolve(self, config: Dict[str, Any], output_dir: Path) -> Tuple[str, str]:
"""
Fetch problem and generate initial program.
Args:
config: The benchmark section from config.yaml
output_dir: Directory where generated files should be placed
Returns:
BenchmarkResolution containing:
- initial_program_path: Path to generated initial program
- evaluator_path: Path to evaluator
- evaluator_env_vars: Dict of environment variables for the evaluator
"""
# 1. Fetch problem from dataset based on config parameters
# 2. Generate initial_program.py with EVOLVE-BLOCK markers
# 3. Prepare evaluator environment variables (returned, not set globally)
# 4. Return BenchmarkResolution with paths and env vars
initial_program_path = output_dir / "initial_program.py"
evaluator_path = Path(__file__).parent / "evaluator"
evaluator_env_vars = {
"BENCHMARK_PARAM": "value",
# Add benchmark-specific configuration here
}
return BenchmarkResolution(
initial_program_path=str(initial_program_path),
evaluator_path=str(evaluator_path),
evaluator_env_vars=evaluator_env_vars,
)
# Module-level resolver instance
resolver = YourBenchmarkResolver()
Add
benchmarksection toconfig.yamlwith your resolver module path and all benchmark-specific parametersUse the same CLI pattern (no initial_program argument needed)
See the implementation in:
skydiscover/benchmarks/base.py- Base resolver interfacebenchmarks/kernelbench/resolver.py- KernelBench example implementation
Adding a Benchmark
Option 1: Containerized evaluator (recommended)
Containerized evaluators run inside Docker, so they can have arbitrary dependencies, system packages, data files, etc. without polluting the host. Only two files are required: Dockerfile and evaluate.sh.
evaluate.sh
The entrypoint that SkyDiscover calls. It receives two arguments:
#!/usr/bin/env bash
set -euo pipefail
PROGRAM="$1" # Path to the candidate solution inside the container
MODE="$2" # "train" (fast, iterative) or "test" (authoritative, final)
python /benchmark/evaluator.py "$PROGRAM"
- train mode is called during the optimization loop β should be relatively fast.
- test mode is called once at the end for the best solution β should be the full, authoritative evaluation.
Evaluators that don't need the distinction can ignore $MODE.
evaluate.sh output (JSON protocol)
evaluate.sh must write a single JSON object to stdout:
{
"status": "success",
"combined_score": 0.73,
"metrics": {"combined_score": 0.73, "accuracy": 0.85, "speed": 1.2},
"artifacts": {"error": "...", "details": "..."}
}
combined_score(float, required): the primary optimization target.metrics(dict of string β float): all numeric scores. Must includecombined_score.artifacts(dict of string β string, optional): non-numeric context (errors, diagnostics).status:"success","error", or"timeout".
Any output to stderr is captured for debugging but does not affect scoring. If your evaluator prints debug output, make sure it goes to stderr, not stdout.
Dockerfile
A standard Dockerfile. The only requirement is that evaluate.sh is executable:
FROM python:3.12-slim
WORKDIR /benchmark
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
RUN chmod +x evaluate.sh
ENTRYPOINT ["./evaluate.sh"]
Migrating an existing Python evaluator
If you have an existing evaluate(program_path) -> dict function, you can wrap it with the backwards-compatibility wrapper:
- Copy
skydiscover/evaluation/wrapper.pyinto yourevaluator/directory. - Add this to the bottom of your
evaluator.py:
if __name__ == "__main__":
from wrapper import run
run(evaluate)
The wrapper handles stdout redirection (so debug prints don't corrupt JSON), error formatting, and metric/artifact separation.
Running a containerized benchmark
Point evaluation_file at the evaluator/ directory:
skydiscover-run benchmarks/math/circle_packing_rect/initial_program.py \
benchmarks/math/circle_packing_rect/evaluator \
-c benchmarks/math/circle_packing_rect/config.yaml \
-s best_of_n -i 50
SkyDiscover will automatically build the Docker image, start a persistent container, and run evaluations inside it.
Example to copy
Simple containerized benchmark: math/heilbronn_triangle/
Option 2: Harbor tasks (external benchmarks)
SkyDiscover natively supports Harbor-format tasks. This lets you run external benchmark suites like AlgoTune (154 algorithm optimization tasks) without any conversion.
A Harbor task directory looks like this:
task_dir/
βββ task.toml # Metadata, timeouts
βββ instruction.md # Problem description (shown to the LLM as context)
βββ environment/
β βββ Dockerfile # Container image definition
βββ tests/
β βββ test.sh # Verification entrypoint
β βββ ... # Supporting test files (evaluator.py, test data, etc.)
βββ solution/ # Reference solution (optional, not shown to LLM)
βββ solve.sh
SkyDiscover auto-detects Harbor tasks when the directory contains instruction.md, tests/, and environment/Dockerfile. The instruction.md is used as LLM context, solutions are injected at the path extracted from solution/solve.sh (or instruction.md as fallback), and rewards are read from /logs/verifier/reward.txt or reward.json.
Tested Harbor datasets
SkyDiscover has been tested with the following Harbor registry benchmarks:
| Dataset | Tasks | Domain | Language | Install |
|---|---|---|---|---|
| algotune | 154 | Algorithm optimization (speedup scoring) | Python | harbor datasets download algotune@1.0 |
| evoeval | 100 | Code generation (evolved from HumanEval) | Python | harbor datasets download evoeval@1.0 |
| humanevalfix | 164 | Code repair (fix buggy functions) | Python | harbor datasets download humanevalfix@1.0 |
| bigcodebench-hard-complete | 145 | Python programming (reward-based) | Python | harbor datasets download bigcodebench-hard-complete@1.0.0 |
| livecodebench | 100 | Competitive programming (stdin/stdout) | Python | harbor datasets download livecodebench@6.0 |
| codepde | 5 | Scientific computing (PDE solvers) | Python | harbor datasets download codepde@1.0 |
| crustbench | 100 | C-to-safe-Rust transpilation | Rust | harbor datasets download crustbench@1.0 |
| usaco | 304 | Competition programming (USACO) | Python | harbor datasets download usaco@2.0 |
Any Harbor-compatible dataset should work β the evaluator automatically extracts the solution path from the task's solution/solve.sh script.
Running a Harbor task
- Install the Harbor CLI and download a dataset:
pip install harbor
harbor datasets download algotune@1.0 -o /tmp/algotune
This downloads all 154 AlgoTune tasks. Each task is in a subdirectory like /tmp/algotune/<id>/algotune-<name>/.
- Run SkyDiscover, pointing at the task directory. The LLM uses
instruction.mdas context and generates solutions from scratch:
# AlgoTune (algorithm optimization)
TASK=/tmp/algotune/2HHbpvzVPo2qakaoGyAVS2/algotune-set-cover
skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 10
# EvoEval (code generation)
harbor datasets download evoeval@1.0 -o /tmp/evoeval
TASK=/tmp/evoeval/<id>/<task-name>
skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 5
# HumanEvalFix (code repair)
harbor datasets download humanevalfix@1.0 -o /tmp/humanevalfix
TASK=/tmp/humanevalfix/<id>/<task-name>
skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 5
SkyDiscover will build the Docker image from environment/Dockerfile, upload tests/ into the container, and start optimizing.
Note: Some datasets have heavy Dockerfiles. AlgoTune needs ~10GB disk and 16GB RAM (torch, jax, scipy). BigCodeBench installs R, GDAL, and many system packages. First builds are slow; subsequent runs use Docker cache.
Other Harbor datasets
Any Harbor-compatible dataset works the same way. Run harbor datasets list to see all available datasets, then harbor datasets download <name> to fetch them.
Option 3: Plain Python evaluator
For simple tasks with no system dependencies, you can use a plain Python evaluator that runs on the host.
Evaluator (evaluator.py) scores whatever the LLM produces:
def evaluate(program_path: str) -> dict:
# load and run the program, compute a score
return {"combined_score": 0.73, ...} # combined_score is required
program_path is a .py file for code tasks or a .txt file for prompt tasks. On failure, return {"combined_score": 0.0, "error": "..."} instead of raising.
Seed program
Seed (initial_program.py or initial_prompt.txt) is the starting solution. Mark the region for the LLM to evolve:
# EVOLVE-BLOCK-START
def solve(input_data):
return input_data # LLM will improve this
# EVOLVE-BLOCK-END
For prompt optimization, use a plain .txt file with no markers.
Config
Config (config.yaml) sets the system prompt and search settings. For prompt optimization, set language: text and diff_based_generation: false.
Simple prompt example to copy: prompt_optimization/hotpot_qa/