sky2 / benchmarks /ale_bench /README.md
JustinTX's picture
Add files using upload-large-folder tool
517cbd2 verified

ALE-Bench: AtCoder Heuristic Contest Benchmark

10 problems from AtCoder Heuristic Contests (AHC), evaluated via the ale_bench package. Programs are written in C++ and scored on 50 public test cases during evolution. A separate private evaluator runs the full hidden test set for final ranking.

Problems

Problem Description
ahc008 Pet partitioning β€” place walls to create pet-free areas on a 30Γ—30 grid over 300 turns
ahc011 AtCoder Heuristic Contest 11
ahc015 AtCoder Heuristic Contest 15
ahc016 AtCoder Heuristic Contest 16
ahc024 AtCoder Heuristic Contest 24
ahc025 Balance weighing β€” use a balance scale to divide N items into D equal-weight sets using Q queries
ahc026 AtCoder Heuristic Contest 26
ahc027 AtCoder Heuristic Contest 27
ahc039 AtCoder Heuristic Contest 39
ahc046 AtCoder Heuristic Contest 46

Quick Start

Run evolution on a single problem:

uv run skydiscover-run \
  benchmarks/ale_bench/ale-bench-lite-problems/ahc025/initial_program.cpp \
  benchmarks/ale_bench/ale-bench-lite-problems/ahc025/evaluator.py \
  -c benchmarks/ale_bench/ale-bench-lite-problems/ahc025/config.yaml \
  --search evox \
  -i 100

Scoring

During evolution, each iteration runs 50 public test cases:

combined_score = overall_absolute_score * optim_factor / num_public_cases

optim_factor is +1 for maximize problems and -1 for minimize problems (so combined_score is always higher-is-better).

Private Evaluation

After evolution, evaluate the best program on the full private test set:

python benchmarks/ale_bench/private_eval.py \
  --program-path path/to/best_program.cpp \
  --problem-id ahc025

This runs 3 independent evaluations and reports the average private rank, performance score, and per-case pass/fail counts.

Directory Structure

ale_bench/
β”œβ”€β”€ ale-bench-lite-problems/
β”‚   └── ahcXXX/
β”‚       β”œβ”€β”€ initial_program.cpp   # Starting C++ solution
β”‚       β”œβ”€β”€ evaluator.py          # Runs 50 public cases via ale_bench
β”‚       └── config.yaml           # Search config (cpp, diff-based, 100 iterations)
β”œβ”€β”€ ale_agent_best/
β”‚   └── ahcXXX.cpp               # Best known solutions (reference)
└── private_eval.py              # Full private set evaluation + ranking

Requirements

Requires the ale_bench and ale_bench_eval packages. These are not in the default uv sync β€” install them separately per the ALE-Bench documentation.

Config Defaults

All problems share the same base config:

language: cpp
diff_based_evolution: true
max_iterations: 100
max_solution_length: 60000
evaluator:
  timeout: 10000