JustinTX
/

sky2

Model card Files Files and versions

sky2 / benchmarks /ale_bench /README.md

JustinTX's picture

Add files using upload-large-folder tool

517cbd2 verified about 2 months ago

|

history blame contribute delete

2.79 kB

	# ALE-Bench: AtCoder Heuristic Contest Benchmark

	10 problems from AtCoder Heuristic Contests (AHC), evaluated via the `ale_bench` package. Programs are written in C++ and scored on 50 public test cases during evolution. A separate private evaluator runs the full hidden test set for final ranking.

	## Problems

	\| Problem \| Description \|
	\|---------\|-------------\|
	\| `ahc008` \| Pet partitioning — place walls to create pet-free areas on a 30×30 grid over 300 turns \|
	\| `ahc011` \| AtCoder Heuristic Contest 11 \|
	\| `ahc015` \| AtCoder Heuristic Contest 15 \|
	\| `ahc016` \| AtCoder Heuristic Contest 16 \|
	\| `ahc024` \| AtCoder Heuristic Contest 24 \|
	\| `ahc025` \| Balance weighing — use a balance scale to divide N items into D equal-weight sets using Q queries \|
	\| `ahc026` \| AtCoder Heuristic Contest 26 \|
	\| `ahc027` \| AtCoder Heuristic Contest 27 \|
	\| `ahc039` \| AtCoder Heuristic Contest 39 \|
	\| `ahc046` \| AtCoder Heuristic Contest 46 \|

	## Quick Start

	Run evolution on a single problem:

	```bash
	uv run skydiscover-run \
	benchmarks/ale_bench/ale-bench-lite-problems/ahc025/initial_program.cpp \
	benchmarks/ale_bench/ale-bench-lite-problems/ahc025/evaluator.py \
	-c benchmarks/ale_bench/ale-bench-lite-problems/ahc025/config.yaml \
	--search evox \
	-i 100
	```

	## Scoring

	During evolution, each iteration runs 50 public test cases:

	```
	combined_score = overall_absolute_score * optim_factor / num_public_cases
	```

	`optim_factor` is `+1` for maximize problems and `-1` for minimize problems (so `combined_score` is always higher-is-better).

	## Private Evaluation

	After evolution, evaluate the best program on the full private test set:

	```bash
	python benchmarks/ale_bench/private_eval.py \
	--program-path path/to/best_program.cpp \
	--problem-id ahc025
	```

	This runs 3 independent evaluations and reports the average private rank, performance score, and per-case pass/fail counts.

	## Directory Structure

	```
	ale_bench/
	├── ale-bench-lite-problems/
	│ └── ahcXXX/
	│ ├── initial_program.cpp # Starting C++ solution
	│ ├── evaluator.py # Runs 50 public cases via ale_bench
	│ └── config.yaml # Search config (cpp, diff-based, 100 iterations)
	├── ale_agent_best/
	│ └── ahcXXX.cpp # Best known solutions (reference)
	└── private_eval.py # Full private set evaluation + ranking
	```

	## Requirements

	Requires the `ale_bench` and `ale_bench_eval` packages. These are not in the default `uv sync` — install them separately per the ALE-Bench documentation.

	## Config Defaults

	All problems share the same base config:

	```yaml
	language: cpp
	diff_based_evolution: true
	max_iterations: 100
	max_solution_length: 60000
	evaluator:
	timeout: 10000
	```