| # ARC Benchmark |
|
|
| Evolves ARC-AGI visual reasoning task solutions using SkyDiscover. |
|
|
| ## Setup |
|
|
| ### 1. Download ARC data |
|
|
| Clone the ARC-AGI-2 repo and convert the data: |
|
|
| ```bash |
| cd benchmarks/arc_benchmark |
| git clone https://github.com/arcprize/ARC-AGI-2.git /tmp/ARC-AGI-2 |
| OUT_DIR=./data uv run python convert_arc_agi2_data.py /tmp/ARC-AGI-2 |
| rm -rf /tmp/ARC-AGI-2 |
| ``` |
|
|
| This creates 4 files in `data/`: |
| - `arc-agi_training_challenges.json` (1000 tasks) |
| - `arc-agi_training_solutions.json` |
| - `arc-agi_evaluation_challenges.json` (120 tasks) |
| - `arc-agi_evaluation_solutions.json` |
|
|
| ### 2. Set your API key |
|
|
| ```bash |
| export OPENAI_API_KEY=... |
| ``` |
|
|
| ## Run a single task |
|
|
| ARC requires a per-task config (each task has unique training examples as the prompt). Use `generate_config.py` to create one, then run with any search backend: |
|
|
| ```bash |
| cd benchmarks/arc_benchmark |
| |
| # Generate task-specific config |
| TASK_NUM=0 ARC_TASK_FILE=training CONFIG_OUT=./config_task_0.yaml \ |
| uv run python generate_config.py |
| |
| # Run with any backend |
| uv run skydiscover-run initial_program.py evaluator.py \ |
| -c config_task_0.yaml -s [your_algorithm] -i 30 |
| |
| # Or with evox, openevolve, gepa: |
| uv run skydiscover-run initial_program.py evaluator.py \ |
| -c config_task_0.yaml -s [your_algorithm] -i 30 |
| ``` |
|
|
| ## Run all evaluation tasks |
|
|
| ```bash |
| cd benchmarks/arc_benchmark |
| export ARC_TASK_FILE=evaluation |
| |
| NUM_TASKS=$(uv run python -c "import json; print(len(json.load(open('data/arc-agi_evaluation_challenges.json'))))") |
| |
| for i in $(seq 0 $((NUM_TASKS - 1))); do |
| TASK_NUM=$i CONFIG_OUT=./config_task_${i}.yaml uv run python generate_config.py |
| TASK_NUM=$i uv run skydiscover-run initial_program.py evaluator.py \ |
| -c config_task_${i}.yaml -s [your_algorithm] -i 30 \ |
| -o outputs/eval_task_${i} |
| done |
| ``` |
|
|
| ## Post-discovery test evaluation |
|
|
| After the discovery process, evaluate the best program on held-out test inputs: |
|
|
| ```bash |
| TASK_NUM=0 ARC_TASK_FILE=evaluation \ |
| OUTS_DIR=./outputs/eval_task_0/adaevolve \ |
| uv run python post_discovery_eval.py |
| ``` |
|
|
| ## Config: GPT vs Gemini |
|
|
| Edit `config.yaml` — comment the GPT block and uncomment the Gemini block, or override with `--model`: |
|
|
| ```bash |
| uv run skydiscover-run ... -m gemini/gemini-3-pro-preview |
| ``` |
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `initial_program.py` | Seed program with two transform functions to evolve | |
| | `evaluator.py` | Scores programs on pass@2 + cell accuracy | |
| | `config.yaml` | Base config template (prompt injected by generate_config.py) | |
| | `generate_config.py` | Injects task-specific training examples into config as system prompt | |
| | `post_discovery_eval.py` | Evaluates best program on held-out test inputs | |
| | `convert_arc_agi2_data.py` | Converts raw ARC-AGI-2 data to benchmark format | |
| | `requirements.txt` | Dependencies (numpy) | |
|
|
| ## Environment variables |
|
|
| | Variable | Default | Description | |
| |----------|---------|-------------| |
| | `OPENAI_API_KEY` | (required) | API key | |
| | `ARC_TASK_FILE` | `training` | `training` or `evaluation` | |
| | `TASK_NUM` | `0` | Task index within the dataset | |
| | `BASE_CONFIG` | `./config.yaml` | Base config template path | |
| | `CONFIG_OUT` | `./config_task_{N}.yaml` | Output path for generated config | |
| | `DATA_ROOT` | `./data` | Path to ARC data directory | |
| | `MAX_ITERATIONS` | (from config) | Override `max_iterations` at runtime | |
| | `ARC_EVAL_INCLUDE_TEST` | `0` | Set to `1` to also run the held-out test inputs during evolution | |
| | `ARC_EVAL_USE_TEST_FOR_SCORE` | `0` | Set to `1` to average train and test scores into `combined_score` (only used when `ARC_EVAL_INCLUDE_TEST=1`) | |
|
|