ARC Benchmark
Evolves ARC-AGI visual reasoning task solutions using SkyDiscover.
Setup
1. Download ARC data
Clone the ARC-AGI-2 repo and convert the data:
cd benchmarks/arc_benchmark
git clone https://github.com/arcprize/ARC-AGI-2.git /tmp/ARC-AGI-2
OUT_DIR=./data uv run python convert_arc_agi2_data.py /tmp/ARC-AGI-2
rm -rf /tmp/ARC-AGI-2
This creates 4 files in data/:
arc-agi_training_challenges.json(1000 tasks)arc-agi_training_solutions.jsonarc-agi_evaluation_challenges.json(120 tasks)arc-agi_evaluation_solutions.json
2. Set your API key
export OPENAI_API_KEY=...
Run a single task
ARC requires a per-task config (each task has unique training examples as the prompt). Use generate_config.py to create one, then run with any search backend:
cd benchmarks/arc_benchmark
# Generate task-specific config
TASK_NUM=0 ARC_TASK_FILE=training CONFIG_OUT=./config_task_0.yaml \
uv run python generate_config.py
# Run with any backend
uv run skydiscover-run initial_program.py evaluator.py \
-c config_task_0.yaml -s [your_algorithm] -i 30
# Or with evox, openevolve, gepa:
uv run skydiscover-run initial_program.py evaluator.py \
-c config_task_0.yaml -s [your_algorithm] -i 30
Run all evaluation tasks
cd benchmarks/arc_benchmark
export ARC_TASK_FILE=evaluation
NUM_TASKS=$(uv run python -c "import json; print(len(json.load(open('data/arc-agi_evaluation_challenges.json'))))")
for i in $(seq 0 $((NUM_TASKS - 1))); do
TASK_NUM=$i CONFIG_OUT=./config_task_${i}.yaml uv run python generate_config.py
TASK_NUM=$i uv run skydiscover-run initial_program.py evaluator.py \
-c config_task_${i}.yaml -s [your_algorithm] -i 30 \
-o outputs/eval_task_${i}
done
Post-discovery test evaluation
After the discovery process, evaluate the best program on held-out test inputs:
TASK_NUM=0 ARC_TASK_FILE=evaluation \
OUTS_DIR=./outputs/eval_task_0/adaevolve \
uv run python post_discovery_eval.py
Config: GPT vs Gemini
Edit config.yaml — comment the GPT block and uncomment the Gemini block, or override with --model:
uv run skydiscover-run ... -m gemini/gemini-3-pro-preview
Files
| File | Description |
|---|---|
initial_program.py |
Seed program with two transform functions to evolve |
evaluator.py |
Scores programs on pass@2 + cell accuracy |
config.yaml |
Base config template (prompt injected by generate_config.py) |
generate_config.py |
Injects task-specific training examples into config as system prompt |
post_discovery_eval.py |
Evaluates best program on held-out test inputs |
convert_arc_agi2_data.py |
Converts raw ARC-AGI-2 data to benchmark format |
requirements.txt |
Dependencies (numpy) |
Environment variables
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | API key |
ARC_TASK_FILE |
training |
training or evaluation |
TASK_NUM |
0 |
Task index within the dataset |
BASE_CONFIG |
./config.yaml |
Base config template path |
CONFIG_OUT |
./config_task_{N}.yaml |
Output path for generated config |
DATA_ROOT |
./data |
Path to ARC data directory |
MAX_ITERATIONS |
(from config) | Override max_iterations at runtime |
ARC_EVAL_INCLUDE_TEST |
0 |
Set to 1 to also run the held-out test inputs during evolution |
ARC_EVAL_USE_TEST_FOR_SCORE |
0 |
Set to 1 to average train and test scores into combined_score (only used when ARC_EVAL_INCLUDE_TEST=1) |