sky2 / benchmarks /README.md
JustinTX's picture
Add files using upload-large-folder tool
af83196 verified

Benchmarks

~200 optimization tasks across math, systems, algorithms, and reasoning.

Quick Start

Install dependencies first:

# Base
uv sync

# Choose extras based on what you run
uv sync --extra external            # openevolve/gepa/shinkaevolve
uv sync --extra math                # math benchmarks
uv sync --extra adrs                # ADRS benchmarks
uv sync --extra frontier-cs         # frontier-cs-eval benchmark
uv sync --extra prompt-optimization # HotPotQA prompt benchmark

If a benchmark directory has requirements.txt, also run:

uv pip install -r benchmarks/<task>/requirements.txt

Then run:

export OPENAI_API_KEY="..."

# Containerized benchmark (recommended β€” evaluator runs in Docker)
uv run skydiscover-run benchmarks/math/circle_packing_rect/initial_program.py \
  benchmarks/math/circle_packing_rect/evaluator \
  -c benchmarks/math/circle_packing_rect/config.yaml \
  -s best_of_n -i 50

# Plain Python evaluator (runs on host)
uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
  benchmarks/math/circle_packing/evaluator.py \
  -c benchmarks/math/circle_packing/config.yaml \
  -s best_of_n -i 100

Tasks

Benchmark Domain Tasks What it tests
math/ Math 14 Circle packing, Erdos problems, autocorrelation inequalities, geometric optimization
ADRS/ Systems 5 Cloud scheduling, MoE load balancing, model placement, column reordering, transaction scheduling
gpu_mode/ GPU 4 Triton kernel optimization (vecadd, grayscale, trimul, MLA decode)
frontier-cs-eval/ Algorithms 172 Competitive programming (Frontier-CS benchmark, Docker judge)
arc_benchmark/ Reasoning β€” ARC-AGI visual reasoning tasks
ale_bench/ Algorithms 10 Algorithmic contest problems (C++, ALE-Bench)
image_gen/ Creative 1 AI image generation evolution
prompt_optimization/ Prompts 1 Evolve natural-language prompts, not code (HotPotQA)

Each benchmark directory has its own README with setup and run instructions.

Structure

There are three ways to set up a benchmark: a containerized evaluator (recommended for new benchmarks), a Harbor task (for external benchmark suites), or a plain Python evaluator (simplest).

Containerized evaluator (recommended)

<task>/
β”œβ”€β”€ initial_program.py       # Starting solution
β”œβ”€β”€ config.yaml              # System prompt + search/evaluator settings
└── evaluator/               # Self-contained Docker benchmark
    β”œβ”€β”€ Dockerfile
    β”œβ”€β”€ evaluate.sh          # Entrypoint (receives solution path + mode)
    β”œβ”€β”€ evaluator.py         # Scoring logic
    β”œβ”€β”€ requirements.txt     # Python dependencies
    └── ...                  # Any other data/files the evaluator needs

The evaluator/ directory is the Docker build context. Everything inside it gets copied into the image β€” data files, model weights, test fixtures, etc. SkyDiscover auto-detects this layout when evaluation_file points to a directory containing a Dockerfile and evaluate.sh.

Plain Python evaluator

<task>/
β”œβ”€β”€ initial_program.py   # Starting solution
β”œβ”€β”€ evaluator.py         # Scoring function (returns combined_score)
└── config.yaml          # System prompt + search/evaluator settings

Simpler but runs evaluator code directly on the host. Fine for pure-Python tasks with no system dependencies.

Benchmark resolvers (dynamic problem loading)

Some benchmarks support dynamic problem loading through a resolver pattern. Instead of providing a static initial_program.py, the resolver fetches problems from an external dataset based on configuration parameters.

This is useful for benchmark suites with many problems (e.g., KernelBench has hundreds of GPU kernel optimization tasks). The resolver pattern allows you to:

  1. Select specific problems via config parameters (e.g., difficulty level, problem ID)
  2. Automatically generate the initial program from the benchmark dataset
  3. Configure evaluator settings based on the problem specification

Using a benchmark with a resolver

Benchmarks that support resolvers include a benchmark section in their config.yaml:

benchmark:
  enabled: true                    # Enable benchmark loader
  name: kernelbench                # Benchmark name (for logging)
  resolver: benchmarks.kernelbench.resolver  # Python module path
  
  # Benchmark-specific parameters
  level: 2                         # Example: difficulty level
  problem_id: 5                    # Example: specific problem ID

When running such a benchmark, you don't need to provide an initial_program argument:

uv run skydiscover-run benchmarks/kernelbench/evaluator/ \
  -c benchmarks/kernelbench/config.yaml \
  --search adaevolve \
  --iterations 50

The resolver automatically fetches the problem and generates the initial program based on the config parameters.

Implementing a benchmark resolver

To add resolver support to a new benchmark:

  1. Create benchmarks/your_benchmark/resolver.py implementing the BenchmarkResolver interface:
from pathlib import Path
from typing import Any, Dict, Tuple
from skydiscover.benchmarks.base import BenchmarkResolver

class YourBenchmarkResolver(BenchmarkResolver):
    def resolve(self, config: Dict[str, Any], output_dir: Path) -> Tuple[str, str]:
        """
        Fetch problem and generate initial program.
        
        Args:
            config: The benchmark section from config.yaml
            output_dir: Directory where generated files should be placed
            
        Returns:
            BenchmarkResolution containing:
                - initial_program_path: Path to generated initial program
                - evaluator_path: Path to evaluator
                - evaluator_env_vars: Dict of environment variables for the evaluator
        """
        # 1. Fetch problem from dataset based on config parameters
        # 2. Generate initial_program.py with EVOLVE-BLOCK markers
        # 3. Prepare evaluator environment variables (returned, not set globally)
        # 4. Return BenchmarkResolution with paths and env vars
        
        initial_program_path = output_dir / "initial_program.py"
        evaluator_path = Path(__file__).parent / "evaluator"
        evaluator_env_vars = {
            "BENCHMARK_PARAM": "value",
            # Add benchmark-specific configuration here
        }
        
        return BenchmarkResolution(
            initial_program_path=str(initial_program_path),
            evaluator_path=str(evaluator_path),
            evaluator_env_vars=evaluator_env_vars,
        )

# Module-level resolver instance
resolver = YourBenchmarkResolver()
  1. Add benchmark section to config.yaml with your resolver module path and all benchmark-specific parameters

  2. Use the same CLI pattern (no initial_program argument needed)

See the implementation in:

  • skydiscover/benchmarks/base.py - Base resolver interface
  • benchmarks/kernelbench/resolver.py - KernelBench example implementation

Adding a Benchmark

Option 1: Containerized evaluator (recommended)

Containerized evaluators run inside Docker, so they can have arbitrary dependencies, system packages, data files, etc. without polluting the host. Only two files are required: Dockerfile and evaluate.sh.

evaluate.sh

The entrypoint that SkyDiscover calls. It receives two arguments:

#!/usr/bin/env bash
set -euo pipefail

PROGRAM="$1"   # Path to the candidate solution inside the container
MODE="$2"      # "train" (fast, iterative) or "test" (authoritative, final)

python /benchmark/evaluator.py "$PROGRAM"
  • train mode is called during the optimization loop β€” should be relatively fast.
  • test mode is called once at the end for the best solution β€” should be the full, authoritative evaluation.

Evaluators that don't need the distinction can ignore $MODE.

evaluate.sh output (JSON protocol)

evaluate.sh must write a single JSON object to stdout:

{
  "status": "success",
  "combined_score": 0.73,
  "metrics": {"combined_score": 0.73, "accuracy": 0.85, "speed": 1.2},
  "artifacts": {"error": "...", "details": "..."}
}
  • combined_score (float, required): the primary optimization target.
  • metrics (dict of string β†’ float): all numeric scores. Must include combined_score.
  • artifacts (dict of string β†’ string, optional): non-numeric context (errors, diagnostics).
  • status: "success", "error", or "timeout".

Any output to stderr is captured for debugging but does not affect scoring. If your evaluator prints debug output, make sure it goes to stderr, not stdout.

Dockerfile

A standard Dockerfile. The only requirement is that evaluate.sh is executable:

FROM python:3.12-slim
WORKDIR /benchmark

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
RUN chmod +x evaluate.sh

ENTRYPOINT ["./evaluate.sh"]

Migrating an existing Python evaluator

If you have an existing evaluate(program_path) -> dict function, you can wrap it with the backwards-compatibility wrapper:

  1. Copy skydiscover/evaluation/wrapper.py into your evaluator/ directory.
  2. Add this to the bottom of your evaluator.py:
if __name__ == "__main__":
    from wrapper import run
    run(evaluate)

The wrapper handles stdout redirection (so debug prints don't corrupt JSON), error formatting, and metric/artifact separation.

Running a containerized benchmark

Point evaluation_file at the evaluator/ directory:

skydiscover-run benchmarks/math/circle_packing_rect/initial_program.py \
  benchmarks/math/circle_packing_rect/evaluator \
  -c benchmarks/math/circle_packing_rect/config.yaml \
  -s best_of_n -i 50

SkyDiscover will automatically build the Docker image, start a persistent container, and run evaluations inside it.

Example to copy

Simple containerized benchmark: math/heilbronn_triangle/

Option 2: Harbor tasks (external benchmarks)

SkyDiscover natively supports Harbor-format tasks. This lets you run external benchmark suites like AlgoTune (154 algorithm optimization tasks) without any conversion.

A Harbor task directory looks like this:

task_dir/
β”œβ”€β”€ task.toml              # Metadata, timeouts
β”œβ”€β”€ instruction.md         # Problem description (shown to the LLM as context)
β”œβ”€β”€ environment/
β”‚   └── Dockerfile         # Container image definition
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test.sh            # Verification entrypoint
β”‚   └── ...                # Supporting test files (evaluator.py, test data, etc.)
└── solution/              # Reference solution (optional, not shown to LLM)
    └── solve.sh

SkyDiscover auto-detects Harbor tasks when the directory contains instruction.md, tests/, and environment/Dockerfile. The instruction.md is used as LLM context, solutions are injected at the path extracted from solution/solve.sh (or instruction.md as fallback), and rewards are read from /logs/verifier/reward.txt or reward.json.

Tested Harbor datasets

SkyDiscover has been tested with the following Harbor registry benchmarks:

Dataset Tasks Domain Language Install
algotune 154 Algorithm optimization (speedup scoring) Python harbor datasets download algotune@1.0
evoeval 100 Code generation (evolved from HumanEval) Python harbor datasets download evoeval@1.0
humanevalfix 164 Code repair (fix buggy functions) Python harbor datasets download humanevalfix@1.0
bigcodebench-hard-complete 145 Python programming (reward-based) Python harbor datasets download bigcodebench-hard-complete@1.0.0
livecodebench 100 Competitive programming (stdin/stdout) Python harbor datasets download livecodebench@6.0
codepde 5 Scientific computing (PDE solvers) Python harbor datasets download codepde@1.0
crustbench 100 C-to-safe-Rust transpilation Rust harbor datasets download crustbench@1.0
usaco 304 Competition programming (USACO) Python harbor datasets download usaco@2.0

Any Harbor-compatible dataset should work β€” the evaluator automatically extracts the solution path from the task's solution/solve.sh script.

Running a Harbor task

  1. Install the Harbor CLI and download a dataset:
pip install harbor
harbor datasets download algotune@1.0 -o /tmp/algotune

This downloads all 154 AlgoTune tasks. Each task is in a subdirectory like /tmp/algotune/<id>/algotune-<name>/.

  1. Run SkyDiscover, pointing at the task directory. The LLM uses instruction.md as context and generates solutions from scratch:
# AlgoTune (algorithm optimization)
TASK=/tmp/algotune/2HHbpvzVPo2qakaoGyAVS2/algotune-set-cover
skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 10

# EvoEval (code generation)
harbor datasets download evoeval@1.0 -o /tmp/evoeval
TASK=/tmp/evoeval/<id>/<task-name>
skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 5

# HumanEvalFix (code repair)
harbor datasets download humanevalfix@1.0 -o /tmp/humanevalfix
TASK=/tmp/humanevalfix/<id>/<task-name>
skydiscover-run "$TASK" --model anthropic/claude-sonnet-4-6 -s best_of_n -i 5

SkyDiscover will build the Docker image from environment/Dockerfile, upload tests/ into the container, and start optimizing.

Note: Some datasets have heavy Dockerfiles. AlgoTune needs ~10GB disk and 16GB RAM (torch, jax, scipy). BigCodeBench installs R, GDAL, and many system packages. First builds are slow; subsequent runs use Docker cache.

Other Harbor datasets

Any Harbor-compatible dataset works the same way. Run harbor datasets list to see all available datasets, then harbor datasets download <name> to fetch them.

Option 3: Plain Python evaluator

For simple tasks with no system dependencies, you can use a plain Python evaluator that runs on the host.

Evaluator (evaluator.py) scores whatever the LLM produces:

def evaluate(program_path: str) -> dict:
    # load and run the program, compute a score
    return {"combined_score": 0.73, ...}  # combined_score is required

program_path is a .py file for code tasks or a .txt file for prompt tasks. On failure, return {"combined_score": 0.0, "error": "..."} instead of raising.

Seed program

Seed (initial_program.py or initial_prompt.txt) is the starting solution. Mark the region for the LLM to evolve:

# EVOLVE-BLOCK-START
def solve(input_data):
    return input_data  # LLM will improve this
# EVOLVE-BLOCK-END

For prompt optimization, use a plain .txt file with no markers.

Config

Config (config.yaml) sets the system prompt and search settings. For prompt optimization, set language: text and diff_based_generation: false.

Simple prompt example to copy: prompt_optimization/hotpot_qa/