ALE-Bench: AtCoder Heuristic Contest Benchmark
10 problems from AtCoder Heuristic Contests (AHC), evaluated via the ale_bench package. Programs are written in C++ and scored on 50 public test cases during evolution. A separate private evaluator runs the full hidden test set for final ranking.
Problems
| Problem | Description |
|---|---|
ahc008 |
Pet partitioning β place walls to create pet-free areas on a 30Γ30 grid over 300 turns |
ahc011 |
AtCoder Heuristic Contest 11 |
ahc015 |
AtCoder Heuristic Contest 15 |
ahc016 |
AtCoder Heuristic Contest 16 |
ahc024 |
AtCoder Heuristic Contest 24 |
ahc025 |
Balance weighing β use a balance scale to divide N items into D equal-weight sets using Q queries |
ahc026 |
AtCoder Heuristic Contest 26 |
ahc027 |
AtCoder Heuristic Contest 27 |
ahc039 |
AtCoder Heuristic Contest 39 |
ahc046 |
AtCoder Heuristic Contest 46 |
Quick Start
Run evolution on a single problem:
uv run skydiscover-run \
benchmarks/ale_bench/ale-bench-lite-problems/ahc025/initial_program.cpp \
benchmarks/ale_bench/ale-bench-lite-problems/ahc025/evaluator.py \
-c benchmarks/ale_bench/ale-bench-lite-problems/ahc025/config.yaml \
--search evox \
-i 100
Scoring
During evolution, each iteration runs 50 public test cases:
combined_score = overall_absolute_score * optim_factor / num_public_cases
optim_factor is +1 for maximize problems and -1 for minimize problems (so combined_score is always higher-is-better).
Private Evaluation
After evolution, evaluate the best program on the full private test set:
python benchmarks/ale_bench/private_eval.py \
--program-path path/to/best_program.cpp \
--problem-id ahc025
This runs 3 independent evaluations and reports the average private rank, performance score, and per-case pass/fail counts.
Directory Structure
ale_bench/
βββ ale-bench-lite-problems/
β βββ ahcXXX/
β βββ initial_program.cpp # Starting C++ solution
β βββ evaluator.py # Runs 50 public cases via ale_bench
β βββ config.yaml # Search config (cpp, diff-based, 100 iterations)
βββ ale_agent_best/
β βββ ahcXXX.cpp # Best known solutions (reference)
βββ private_eval.py # Full private set evaluation + ranking
Requirements
Requires the ale_bench and ale_bench_eval packages. These are not in the default uv sync β install them separately per the ALE-Bench documentation.
Config Defaults
All problems share the same base config:
language: cpp
diff_based_evolution: true
max_iterations: 100
max_solution_length: 60000
evaluator:
timeout: 10000