ALE-Bench: AtCoder Heuristic Contest Benchmark

10 problems from AtCoder Heuristic Contests (AHC), evaluated via the ale_bench package. Programs are written in C++ and scored on 50 public test cases during evolution. A separate private evaluator runs the full hidden test set for final ranking.

Problems

Problem	Description
`ahc008`	Pet partitioning — place walls to create pet-free areas on a 30×30 grid over 300 turns
`ahc011`	AtCoder Heuristic Contest 11
`ahc015`	AtCoder Heuristic Contest 15
`ahc016`	AtCoder Heuristic Contest 16
`ahc024`	AtCoder Heuristic Contest 24
`ahc025`	Balance weighing — use a balance scale to divide N items into D equal-weight sets using Q queries
`ahc026`	AtCoder Heuristic Contest 26
`ahc027`	AtCoder Heuristic Contest 27
`ahc039`	AtCoder Heuristic Contest 39
`ahc046`	AtCoder Heuristic Contest 46

Quick Start

Run evolution on a single problem:

uv run skydiscover-run \
  benchmarks/ale_bench/ale-bench-lite-problems/ahc025/initial_program.cpp \
  benchmarks/ale_bench/ale-bench-lite-problems/ahc025/evaluator.py \
  -c benchmarks/ale_bench/ale-bench-lite-problems/ahc025/config.yaml \
  --search evox \
  -i 100

Scoring

During evolution, each iteration runs 50 public test cases:

combined_score = overall_absolute_score * optim_factor / num_public_cases

optim_factor is +1 for maximize problems and -1 for minimize problems (so combined_score is always higher-is-better).

Private Evaluation

After evolution, evaluate the best program on the full private test set:

python benchmarks/ale_bench/private_eval.py \
  --program-path path/to/best_program.cpp \
  --problem-id ahc025

This runs 3 independent evaluations and reports the average private rank, performance score, and per-case pass/fail counts.

Directory Structure

ale_bench/
├── ale-bench-lite-problems/
│   └── ahcXXX/
│       ├── initial_program.cpp   # Starting C++ solution
│       ├── evaluator.py          # Runs 50 public cases via ale_bench
│       └── config.yaml           # Search config (cpp, diff-based, 100 iterations)
├── ale_agent_best/
│   └── ahcXXX.cpp               # Best known solutions (reference)
└── private_eval.py              # Full private set evaluation + ranking

Requirements

Requires the ale_bench and ale_bench_eval packages. These are not in the default uv sync — install them separately per the ALE-Bench documentation.

Config Defaults

All problems share the same base config:

language: cpp
diff_based_evolution: true
max_iterations: 100
max_solution_length: 60000
evaluator:
  timeout: 10000