sky2 / benchmarks /ale_bench /README.md
JustinTX's picture
Add files using upload-large-folder tool
517cbd2 verified
# ALE-Bench: AtCoder Heuristic Contest Benchmark
10 problems from AtCoder Heuristic Contests (AHC), evaluated via the `ale_bench` package. Programs are written in C++ and scored on 50 public test cases during evolution. A separate private evaluator runs the full hidden test set for final ranking.
## Problems
| Problem | Description |
|---------|-------------|
| `ahc008` | Pet partitioning β€” place walls to create pet-free areas on a 30Γ—30 grid over 300 turns |
| `ahc011` | AtCoder Heuristic Contest 11 |
| `ahc015` | AtCoder Heuristic Contest 15 |
| `ahc016` | AtCoder Heuristic Contest 16 |
| `ahc024` | AtCoder Heuristic Contest 24 |
| `ahc025` | Balance weighing β€” use a balance scale to divide N items into D equal-weight sets using Q queries |
| `ahc026` | AtCoder Heuristic Contest 26 |
| `ahc027` | AtCoder Heuristic Contest 27 |
| `ahc039` | AtCoder Heuristic Contest 39 |
| `ahc046` | AtCoder Heuristic Contest 46 |
## Quick Start
Run evolution on a single problem:
```bash
uv run skydiscover-run \
benchmarks/ale_bench/ale-bench-lite-problems/ahc025/initial_program.cpp \
benchmarks/ale_bench/ale-bench-lite-problems/ahc025/evaluator.py \
-c benchmarks/ale_bench/ale-bench-lite-problems/ahc025/config.yaml \
--search evox \
-i 100
```
## Scoring
During evolution, each iteration runs 50 public test cases:
```
combined_score = overall_absolute_score * optim_factor / num_public_cases
```
`optim_factor` is `+1` for maximize problems and `-1` for minimize problems (so `combined_score` is always higher-is-better).
## Private Evaluation
After evolution, evaluate the best program on the full private test set:
```bash
python benchmarks/ale_bench/private_eval.py \
--program-path path/to/best_program.cpp \
--problem-id ahc025
```
This runs 3 independent evaluations and reports the average private rank, performance score, and per-case pass/fail counts.
## Directory Structure
```
ale_bench/
β”œβ”€β”€ ale-bench-lite-problems/
β”‚ └── ahcXXX/
β”‚ β”œβ”€β”€ initial_program.cpp # Starting C++ solution
β”‚ β”œβ”€β”€ evaluator.py # Runs 50 public cases via ale_bench
β”‚ └── config.yaml # Search config (cpp, diff-based, 100 iterations)
β”œβ”€β”€ ale_agent_best/
β”‚ └── ahcXXX.cpp # Best known solutions (reference)
└── private_eval.py # Full private set evaluation + ranking
```
## Requirements
Requires the `ale_bench` and `ale_bench_eval` packages. These are not in the default `uv sync` β€” install them separately per the ALE-Bench documentation.
## Config Defaults
All problems share the same base config:
```yaml
language: cpp
diff_based_evolution: true
max_iterations: 100
max_solution_length: 60000
evaluator:
timeout: 10000
```