Add files using upload-large-folder tool

b0e88cf verified about 1 month ago

3.6 kB

	# ARC Benchmark

	Evolves ARC-AGI visual reasoning task solutions using SkyDiscover.

	## Setup

	### 1. Download ARC data

	Clone the ARC-AGI-2 repo and convert the data:

	```bash
	cd benchmarks/arc_benchmark
	git clone https://github.com/arcprize/ARC-AGI-2.git /tmp/ARC-AGI-2
	OUT_DIR=./data uv run python convert_arc_agi2_data.py /tmp/ARC-AGI-2
	rm -rf /tmp/ARC-AGI-2
	```

	This creates 4 files in `data/`:
	- `arc-agi_training_challenges.json` (1000 tasks)
	- `arc-agi_training_solutions.json`
	- `arc-agi_evaluation_challenges.json` (120 tasks)
	- `arc-agi_evaluation_solutions.json`

	### 2. Set your API key

	```bash
	export OPENAI_API_KEY=...
	```

	## Run a single task

	ARC requires a per-task config (each task has unique training examples as the prompt). Use `generate_config.py` to create one, then run with any search backend:

	```bash
	cd benchmarks/arc_benchmark

	# Generate task-specific config
	TASK_NUM=0 ARC_TASK_FILE=training CONFIG_OUT=./config_task_0.yaml \
	uv run python generate_config.py

	# Run with any backend
	uv run skydiscover-run initial_program.py evaluator.py \
	-c config_task_0.yaml -s [your_algorithm] -i 30

	# Or with evox, openevolve, gepa:
	uv run skydiscover-run initial_program.py evaluator.py \
	-c config_task_0.yaml -s [your_algorithm] -i 30
	```

	## Run all evaluation tasks

	```bash
	cd benchmarks/arc_benchmark
	export ARC_TASK_FILE=evaluation

	NUM_TASKS=$(uv run python -c "import json; print(len(json.load(open('data/arc-agi_evaluation_challenges.json'))))")

	for i in $(seq 0 $((NUM_TASKS - 1))); do
	TASK_NUM=$i CONFIG_OUT=./config_task_${i}.yaml uv run python generate_config.py
	TASK_NUM=$i uv run skydiscover-run initial_program.py evaluator.py \
	-c config_task_${i}.yaml -s [your_algorithm] -i 30 \
	-o outputs/eval_task_${i}
	done
	```

	## Post-discovery test evaluation

	After the discovery process, evaluate the best program on held-out test inputs:

	```bash
	TASK_NUM=0 ARC_TASK_FILE=evaluation \
	OUTS_DIR=./outputs/eval_task_0/adaevolve \
	uv run python post_discovery_eval.py
	```

	## Config: GPT vs Gemini

	Edit `config.yaml` — comment the GPT block and uncomment the Gemini block, or override with `--model`:

	```bash
	uv run skydiscover-run ... -m gemini/gemini-3-pro-preview
	```

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `initial_program.py` \| Seed program with two transform functions to evolve \|
	\| `evaluator.py` \| Scores programs on pass@2 + cell accuracy \|
	\| `config.yaml` \| Base config template (prompt injected by generate_config.py) \|
	\| `generate_config.py` \| Injects task-specific training examples into config as system prompt \|
	\| `post_discovery_eval.py` \| Evaluates best program on held-out test inputs \|
	\| `convert_arc_agi2_data.py` \| Converts raw ARC-AGI-2 data to benchmark format \|
	\| `requirements.txt` \| Dependencies (numpy) \|

	## Environment variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `OPENAI_API_KEY` \| (required) \| API key \|
	\| `ARC_TASK_FILE` \| `training` \| `training` or `evaluation` \|
	\| `TASK_NUM` \| `0` \| Task index within the dataset \|
	\| `BASE_CONFIG` \| `./config.yaml` \| Base config template path \|
	\| `CONFIG_OUT` \| `./config_task_{N}.yaml` \| Output path for generated config \|
	\| `DATA_ROOT` \| `./data` \| Path to ARC data directory \|
	\| `MAX_ITERATIONS` \| (from config) \| Override `max_iterations` at runtime \|
	\| `ARC_EVAL_INCLUDE_TEST` \| `0` \| Set to `1` to also run the held-out test inputs during evolution \|
	\| `ARC_EVAL_USE_TEST_FOR_SCORE` \| `0` \| Set to `1` to average train and test scores into `combined_score` (only used when `ARC_EVAL_INCLUDE_TEST=1`) \|