JustinTX's picture
Add files using upload-large-folder tool
b0e88cf verified

ARC Benchmark

Evolves ARC-AGI visual reasoning task solutions using SkyDiscover.

Setup

1. Download ARC data

Clone the ARC-AGI-2 repo and convert the data:

cd benchmarks/arc_benchmark
git clone https://github.com/arcprize/ARC-AGI-2.git /tmp/ARC-AGI-2
OUT_DIR=./data uv run python convert_arc_agi2_data.py /tmp/ARC-AGI-2
rm -rf /tmp/ARC-AGI-2

This creates 4 files in data/:

  • arc-agi_training_challenges.json (1000 tasks)
  • arc-agi_training_solutions.json
  • arc-agi_evaluation_challenges.json (120 tasks)
  • arc-agi_evaluation_solutions.json

2. Set your API key

export OPENAI_API_KEY=...

Run a single task

ARC requires a per-task config (each task has unique training examples as the prompt). Use generate_config.py to create one, then run with any search backend:

cd benchmarks/arc_benchmark

# Generate task-specific config
TASK_NUM=0 ARC_TASK_FILE=training CONFIG_OUT=./config_task_0.yaml \
  uv run python generate_config.py

# Run with any backend
uv run skydiscover-run initial_program.py evaluator.py \
  -c config_task_0.yaml -s [your_algorithm] -i 30

# Or with evox, openevolve, gepa:
uv run skydiscover-run initial_program.py evaluator.py \
  -c config_task_0.yaml -s [your_algorithm] -i 30

Run all evaluation tasks

cd benchmarks/arc_benchmark
export ARC_TASK_FILE=evaluation

NUM_TASKS=$(uv run python -c "import json; print(len(json.load(open('data/arc-agi_evaluation_challenges.json'))))")

for i in $(seq 0 $((NUM_TASKS - 1))); do
  TASK_NUM=$i CONFIG_OUT=./config_task_${i}.yaml uv run python generate_config.py
  TASK_NUM=$i uv run skydiscover-run initial_program.py evaluator.py \
    -c config_task_${i}.yaml -s [your_algorithm] -i 30 \
    -o outputs/eval_task_${i}
done

Post-discovery test evaluation

After the discovery process, evaluate the best program on held-out test inputs:

TASK_NUM=0 ARC_TASK_FILE=evaluation \
  OUTS_DIR=./outputs/eval_task_0/adaevolve \
  uv run python post_discovery_eval.py

Config: GPT vs Gemini

Edit config.yaml — comment the GPT block and uncomment the Gemini block, or override with --model:

uv run skydiscover-run ... -m gemini/gemini-3-pro-preview

Files

File Description
initial_program.py Seed program with two transform functions to evolve
evaluator.py Scores programs on pass@2 + cell accuracy
config.yaml Base config template (prompt injected by generate_config.py)
generate_config.py Injects task-specific training examples into config as system prompt
post_discovery_eval.py Evaluates best program on held-out test inputs
convert_arc_agi2_data.py Converts raw ARC-AGI-2 data to benchmark format
requirements.txt Dependencies (numpy)

Environment variables

Variable Default Description
OPENAI_API_KEY (required) API key
ARC_TASK_FILE training training or evaluation
TASK_NUM 0 Task index within the dataset
BASE_CONFIG ./config.yaml Base config template path
CONFIG_OUT ./config_task_{N}.yaml Output path for generated config
DATA_ROOT ./data Path to ARC data directory
MAX_ITERATIONS (from config) Override max_iterations at runtime
ARC_EVAL_INCLUDE_TEST 0 Set to 1 to also run the held-out test inputs during evolution
ARC_EVAL_USE_TEST_FOR_SCORE 0 Set to 1 to average train and test scores into combined_score (only used when ARC_EVAL_INCLUDE_TEST=1)