File size: 17,263 Bytes
517cbd2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 | <h1 align="center">
Β <img src="assets/logo_vector.png" height="80" alt="SkyDiscover logo" style="vertical-align: middle;">
Β <b>SkyDiscover</b>
</h1>
<p align="center"> A Flexible Framework for AI-Driven Scientific and Algorithmic Discovery</p>
<p align="center">
Β <a href="https://skydiscover-ai.github.io/blog.html"><img src="https://img.shields.io/badge/blog-SkyDiscover-orange?style=flat-square" alt="Blog" /></a>
<a href="https://arxiv.org/abs/2602.20133"><img src="https://img.shields.io/badge/paper-AdaEvolve-red?style=flat-square" alt="AdaEvolve Paper" /></a>
<a href="https://arxiv.org/abs/2602.23413"><img src="https://img.shields.io/badge/paper-EvoX-lightblue?style=flat-square" alt="EvoX Paper" /></a>
Β <a href="LICENSE"><img src="https://img.shields.io/badge/license-Apache--2.0-green?style=flat-square" /></a>
</p>
<p align="center">
<img src="assets/architecture.png" width="720" alt="SkyDiscover architecture"><br>
</p>
**SkyDiscover** is a modular framework for AI-driven scientific and algorithmic discovery, providing a unified interface for implementing, running, and fairly comparing discovery algorithms across 200+ optimization tasks.
SkyDiscover introduces two new adaptive optimization algorithms:
- **[AdaEvolve](https://arxiv.org/abs/2602.20133)**, which dynamically adjusts its optimization behavior based on observed progress.
- **[EvoX](https://arxiv.org/abs/2602.23413)**, which dynamically evolves the optimization (evolution) strategy itself using LLMs on the fly.
SkyDiscover also supports using OpenEvolve, ShinkaEvolve and GEPA to quickly benchmark these algorithms using their own source code. SkyDiscover also hosts native versions of OpenEvolve and GEPA under `openevolve_native` and `gepa_native` algorithms using the modular interface.
SkyDiscover natively supports [Harbor](https://harborframework.com/)-format benchmarks, so you can run external benchmark suites out of the box, including [AlgoTune](https://github.com/oripress/AlgoTune), [EvoEval](https://github.com/evo-eval/evoeval), [HumanEvalFix](https://github.com/bigcode-project/octopack), [BigCodeBench](https://github.com/bigcode-project/bigcodebench), [LiveCodeBench](https://livecodebench.github.io/), [USACO](https://usaco.org/), [CRUSTBench](https://github.com/AInfinity/CRUSTBench), and [CodePDE](https://github.com/).
> π§ This project is under active development.
---
## π Benchmark Performance
Across ~200 optimization benchmarks, AdaEvolve and EvoX achieve the strongest open-source results: matching or exceeding AlphaEvolve and human SOTA, and outperforming OpenEvolve, GEPA, and ShinkaEvolve under identical generation budgets.
- **Frontier-CS (172 problems)**: ~34% median score improvement over OpenEvolve, GEPA, and ShinkaEvolve
- **Math + Systems Optimization (14 tasks evaluated)**: Matches or exceeds AlphaEvolve and human-designed SOTA on 6/6 systems and 6/8 math tasks
- **Real-world systems impact**: 41% lower cross-cloud transfer cost, 14% better GPU load balance for MoE serving, and 29% lower KV-cache pressure via GPU model placement
<p align="center">
<img src="assets/benchmarks.png" width="900" alt="SkyDiscover benchmarks">
</p>
<details>
<summary><b>π Complete results of AdaEvolve and EvoX (100 iterations)</b></summary>
> AdaEvolve and EvoX are **complementary**: AdaEvolve adapts search *parameters* for fast early gains; EvoX evolves the search *strategy itself* for stronger long-horizon gains. Both are built on SkyDiscover.
<p align="center">
<img src="assets/comparison.png" width="900" alt="Main results for systems and math problems">
</p>
</details>
<details>
<summary><b>π Scaling behavior of AdaEvolve and EvoX</b></summary>
The scaling behavior of AdaEvolve and EvoX shows a **complementary crossover**. AdaEvolve's per-iteration parameter adaptation yields fast early gains in low-budget runs (Tβ€50), while EvoX's demand-driven strategy evolution unlocks step-change improvements in longer runs (Tβ₯50).
<p align="center">
<img src="assets/scaling_comparison.png" width="900" alt="Scaling behavior of AdaEvolve vs EvoX across 500 iterations">
<br><em>Best-so-far score vs. iteration for Signal Processing, Heilbronn Convex, Prism, and Cloudcast (500 iterations, GPT-5).</em>
</p>
</details>
<details>
<summary><b>π Evolving AdaEvolve's policy with EvoX (coming soon)</b></summary>
The two methods are **composable**: EvoX can evolve using AdaEvolve as its starting strategy, achieving the best results on 3 out of 4 benchmarks (100 iterations, GPT-5). This combined mode will be available in SkyDiscover soon.
| Benchmark | AdaEvolve | EvoX (Random Init) | EvoX (AdaEvolve Init) |
|:--|--:|--:|--:|
| Signal Proc. (β) | 0.718 | 0.721 | **0.760** |
| Heilbronn Cvx. (β) | 0.0290 | 0.0270 | **0.0291** |
| Cloudcast (β) | 640.5 | 637.1 | **623.4** |
| Prism (β) | 26.37 | **30.52** | 26.27 |
</details>
<details>
<summary><b>Task breakdown across math, systems, and programming challenges</b></summary>
| | Benchmark | Domain | Tasks | Description |
|-|-----------|--------|------:|-------------|
| π’ | [math/](benchmarks/math/) | Math | 14 | Circle packing, Erdos problems, geometric optimization |
| π₯οΈ | [ADRS/](benchmarks/ADRS/) | Systems | 5 | Cloud scheduling, load balancing, MoE expert placement |
| β‘ | [gpu_mode/](benchmarks/gpu_mode/) | Systems | 4 | GPU kernel optimization |
| π§ | [kernelbench/](benchmarks/kernelbench/) | Systems | 250+ | [KernelBench](https://github.com/ScalingIntelligence/KernelBench) GPU kernel speedup optimization |
| π§© | [frontier-cs-eval/](benchmarks/frontier-cs-eval/) | Algorithms | 172 | [Frontier-CS](https://frontier-cs.org/) competitive programming |
| π§ | [arc_benchmark/](benchmarks/arc_benchmark/) | Reasoning | β | ARC-AGI visual reasoning |
| π» | [ale_bench/](benchmarks/ale_bench/) | Algorithms | 10 | Algorithmic programming contests |
| π¨ | [image_gen/](benchmarks/image_gen/) | Creative | 1 | AI image generation evolution |
| π¬ | [prompt_optimization/](benchmarks/prompt_optimization/) | NLP | 1 | HotPotQA prompt evolution |
See [Dependency extras](#dependency-extras) for install commands per benchmark.
</details>
## π Quick Start
**Prerequisites:** Python >= 3.10, [uv](https://docs.astral.sh/uv/)
```bash
# Install
uv sync
export OPENAI_API_KEY="<your-key>"
# Try the circle packing benchmark
uv sync --extra math
uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
benchmarks/math/circle_packing/evaluator.py \
--config benchmarks/math/circle_packing/config.yaml \
--search evox \
--iterations 100
uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
benchmarks/math/circle_packing/evaluator.py \
--config benchmarks/math/circle_packing/config.yaml \
--search adaevolve \
--iterations 100
# Or run on your own problem
# algo can be "evox", "adaevolve", "openevolve", "gepa", "shinkaevolve"
uv run skydiscover-run initial_program.py evaluator.py \
--search <algo> \
--model gpt-5 \
--iterations 100
# initial_program is optional β omit it to let the LLM start from scratch
uv run skydiscover-run evaluator.py \
--search <algo> \
--model gpt-5 \
--iterations 100
# Run a Harbor benchmark (e.g. AlgoTune) β no seed program needed
pip install harbor
harbor datasets download algotune@1.0 -o /tmp/algotune
uv run skydiscover-run /tmp/algotune/<id>/algotune-set-cover \
--model anthropic/claude-sonnet-4-6 \
--search best_of_n -i 10
```
Or use the Python API:
```python
from skydiscover import run_discovery
result = run_discovery(
initial_program="initial_program.py",
evaluator="evaluator.py",
search=[algo], # algo can be "adaevolve", "evox", "openevolve", "gepa", "shinkaevolve"
model="gpt-5",
iterations=100,
)
print(result.best_score, result.best_solution)
```
## βοΈ What You Write
### Scoring Function (required)
SkyDiscover supports three evaluator formats β pick whichever fits your use case:
| Format | When to use | What you point `evaluation_file` at |
|:---|:---|:---|
| **Python function** | Simple tasks, no system deps | `evaluator.py` |
| **Containerized** | Custom deps, data files, isolation | `evaluator/` directory (must contain `Dockerfile` + `evaluate.sh`) |
| **Harbor task** | External benchmark suites (AlgoTune, EvoEval, HumanEvalFix, BigCodeBench, LiveCodeBench, USACO, CRUSTBench, CodePDE, and more) | Task directory (must contain `instruction.md` + `tests/` + `environment/Dockerfile`) |
SkyDiscover auto-detects the format. See [`benchmarks/README.md`](benchmarks/README.md#adding-a-benchmark) for full setup instructions.
**Python evaluator** β a file with an `evaluate(program_path)` function:
```python
def evaluate(program_path):
score = run_and_grade(program_path)
return {
"combined_score": score, # primary optimization target (maximized)
"artifacts": { # optional β stored with the solution for future context
"feedback": "Off by one in the loop boundary",
},
}
```
**Containerized evaluator** β a directory with a `Dockerfile` and `evaluate.sh` that writes JSON to stdout. Runs in Docker, so it can have arbitrary dependencies.
**Harbor task** β a directory following the [Harbor](https://harborframework.com/) format (`instruction.md`, `environment/Dockerfile`, `tests/test.sh`). Works out of the box with 8+ tested benchmark suites (see [benchmarks/README.md](benchmarks/README.md#tested-harbor-datasets) for the full list).
- **combined_score** drives evolution. If omitted, SkyDiscover averages all numeric values in the dict.
- **artifacts** is optional β entries are injected into the next LLM prompt as context.
For `search.type: adaevolve`, you can also enable explicit Pareto optimization by configuring `search.database.pareto_objectives` and returning those objective metrics directly from the evaluator. In that mode, `combined_score` becomes optional and is only used as a scalar fallback/proxy when configured.
### Starting Solution (optional)
The initial program is **optional**. When omitted, the LLM generates a solution from scratch. If provided, it marks the region to mutate with EVOLVE-BLOCK markers. Everything outside is left untouched.
```python
# EVOLVE-BLOCK-START
def solve(input_data):
return input_data # baseline β SkyDiscover will improve this
# EVOLVE-BLOCK-END
```
If no markers are present, the entire file is treated as mutatable.
## 𧬠Pick an Algorithm
See [Benchmark Performance](#-benchmark-performance) for a detailed comparison of AdaEvolve and EvoX against other algorithms.
| Algorithm | Flag | Description |
|:---|:---|:---|
| β **AdaEvolve** | `--search adaevolve` | Multi-island adaptive search with UCB, migration, and paradigm breakthroughs |
| π§ **EvoX** | `--search evox` | Self-evolving paradigm that co-adapts solution generation and experience management |
| π **Top-K** | `--search topk` | Selects top-K solutions to refine |
| π **Beam Search** | `--search beam_search` | Breadth-first expansion of a beam of top solutions |
| π² **Best-of-N** | `--search best_of_n` | Generates N variants per iteration, keeps the best |
| π§ͺ **GEPA Native** | `--search gepa_native` | Pareto-efficient search with reflective prompting and LLM-mediated merge |
| πΊοΈ **OpenEvolve Native** | `--search openevolve_native` | MAP-Elites + island-based evolutionary search |
### External backends
Install with `uv sync --extra external`, then use the corresponding flag:
| Backend | Flag | Source |
|:---|:---|:---|
| **OpenEvolve** | `--search openevolve` | [codelion/openevolve](https://github.com/codelion/openevolve) |
| **GEPA** | `--search gepa` | [gepa-ai/gepa](https://github.com/gepa-ai/gepa) |
| **ShinkaEvolve** | `--search shinkaevolve` | [SakanaAI/ShinkaEvolve](https://github.com/SakanaAI/ShinkaEvolve) (manual install) |
<details>
<summary>ShinkaEvolve manual install</summary>
```bash
git clone --depth 1 https://github.com/SakanaAI/ShinkaEvolve.git external_repos/ShinkaEvolve
uv pip install -e external_repos/ShinkaEvolve
```
</details>
## βοΈ Configuration
Pass a YAML config with `-c`. See [configs/](configs/) for full annotated templates.
```yaml
max_iterations: 100
llm:
models: [{ name: "gemini/gemini-3-pro-preview", weight: 1.0 }]
search:
type: "adaevolve" # or "evox", "topk", "beam_search", "best_of_n"
prompt:
system_message: |
You are an expert at optimizing algorithms.
```
API keys (OPENAI_API_KEY, GEMINI_API_KEY, etc.) are resolved from environment variables automatically.
### π Live Monitor & Human Feedback
Add `monitor: { enabled: true }` to your config. The dashboard URL prints at run start β scatter plot of all programs, code diffs, metrics, and AI summaries. A **Human Feedback** panel lets you steer evolution in real time.
Replay a completed run:
```bash
uv run skydiscover-viewer /path/to/checkpoints/checkpoint_100
```
## π Reference
<details>
<summary><b>CLI flags</b></summary>
```
uv run skydiscover-run [INITIAL_PROGRAM] EVALUATOR [options]
```
| Flag | Description |
|:---|:---|
| `-c, --config FILE` | Config YAML |
| `-i, --iterations N` | Number of iterations |
| `-m, --model MODEL` | LLM model (overrides config) |
| `-s, --search TYPE` | Search algorithm |
| `-o, --output DIR` | Output directory |
| `--api-base URL` | Override LLM API endpoint |
| `--checkpoint DIR` | Resume from checkpoint |
| `--agentic` | Enable agentic mode (LLM can read your files) |
| `-l, --log-level LEVEL` | DEBUG, INFO, WARNING, or ERROR |
</details>
<details>
<summary><b>Python API β discover_solution() (convenience wrapper)</b></summary>
`discover_solution()` is a convenience wrapper around `run_discovery()` (shown in [Quick Start](#-quick-start)) for inline string solutions and callable evaluators:
```python
from skydiscover import discover_solution
result = discover_solution(
initial_solution="def solve(x): return x", # optional β omit to start from scratch
evaluator=lambda path: {"combined_score": run_tests(path)},
iterations=50,
search="evox",
)
```
</details>
<details>
<summary><b>Model providers</b></summary>
Any [LiteLLM](https://docs.litellm.ai/)-compatible model works using `provider/model` format:
```bash
--model gpt-5 # OpenAI (default)
--model gemini/gemini-3-pro-preview # Gemini
--model anthropic/claude-sonnet-4-20250514 # Anthropic
--model ollama/llama3 --api-base http://localhost:11434/v1 # Local (Ollama, vLLM, etc.)
```
Multi-model pools with weighted sampling are supported in config:
```yaml
llm:
models:
- name: "gpt-5-mini"
weight: 0.7
- name: "gemini/gemini-2.0-flash"
weight: 0.3
```
</details>
<details id="dependency-extras">
<summary><b>Benchmark dependency extras</b></summary>
```bash
uv sync # Base install
uv sync --extra math # Math benchmarks (SciPy, JAX, PyWavelets, β¦)
uv sync --extra adrs # ADRS systems benchmarks
uv sync --extra frontier-cs # Frontier-CS benchmark tooling
uv sync --extra external # OpenEvolve / GEPA / ShinkaEvolve backends
uv sync --extra prompt-optimization # HotPotQA prompt optimization
```
Combine extras as needed: `uv sync --extra external --extra math`
If a benchmark ships its own `requirements.txt`, also run: `uv pip install -r path/to/requirements.txt`
</details>
---
## π οΈ Extending SkyDiscover
- **New benchmark** β [`benchmarks/README.md`](benchmarks/README.md#adding-a-benchmark)
- **New search algorithm** β [`skydiscover/search/README.md`](skydiscover/search/README.md)
- **New context builder** β [`skydiscover/context_builder/README.md`](skydiscover/context_builder/README.md)
---
## π Related Work
SkyDiscover is inspired by [AlphaEvolve](https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/) and incorporates useful code components from open-source efforts such as [OpenEvolve](https://github.com/codelion/openevolve). Its interface is compatible with the [optimize_anything](https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/) API.
## βοΈ Citation
```bibtex
@misc{skydiscover2026,
title = {SkyDiscover: A Flexible Framework for AI-Driven Scientific and Algorithmic Discovery},
author = {Liu, Shu and Cemri, Mert and Agarwal, Shubham and Krentsel, Alexander and Naren, Ashwin and Mang, Qiuyang and Li, Zhifei and Gupta, Akshat and Maheswaran, Monishwaran and Cheng, Audrey and Pan, Melissa and Boneh, Ethan and Ramchandran, Kannan and Sen, Koushik and Dimakis, Alexandros G. and Zaharia, Matei and Stoica, Ion},
year = {2026},
url = {https://skydiscover-ai.github.io/blog.html}
}
```
## π¬ Contact Us
For questions or feedback, reach out to us:
[lshu@berkeley.edu](mailto:lshu@berkeley.edu) Β· [mert_cemri@berkeley.edu](mailto:mert_cemri@berkeley.edu) Β· [shubham3@berkeley.edu](mailto:shubham3@berkeley.edu)
|