File size: 10,045 Bytes
2facf1f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 | <p align="">
<a href="https://frontier-cs.org">
<img src="assets/logo.png" alt="Frontier-CS Logo" width="2000"/>
</a>
</p>
<h2 align="center">
Evolving Challenges for Evolving Intelligence
</h2>
<p align="center">
<a href="https://frontier-cs.org"><img src="https://img.shields.io/badge/Website-frontier--cs.org-orange?logo=googlechrome" alt="Website"></a>
<a href="https://discord.gg/k4hd2nU4UE"><img src="https://img.shields.io/badge/Discord-Join_Community-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
<a href="https://deepwiki.com/FrontierCS/Frontier-CS"><img src="https://img.shields.io/badge/DeepWiki-Documentation-blue?logo=bookstack&logoColor=white" alt="DeepWiki"></a>
<br>
<a href="https://arxiv.org/abs/2512.15699"><img src="https://img.shields.io/badge/arXiv-2512.15699-b31b1b?logo=arxiv&logoColor=white" alt="arXiv"></a>
<a href="https://huggingface.co/datasets/FrontierCS/Frontier-CS" target="_blank">
<img src="https://img.shields.io/badge/Hugging_Face-🤗%20Datasets-orange" alt="Hugging Face">
</a>
<img src="https://img.shields.io/badge/Research_Problems-68-blue" alt="Research Problems">
<img src="https://img.shields.io/badge/Algorithmic_Problems-172-green" alt="Algorithmic Problems">
</p>
## What is Frontier-CS?
**Frontier-CS** is an _unsolved_, _open-ended_, _verifiable_, and _diverse_ benchmark for evaluating AI on challenging computer science problems.
Think of it as an "exam" for AI, but instead of easy textbook questions, we give problems that are genuinely difficult: ones that researchers struggle with, that have no known optimal solutions, or that require deep expertise to even attempt.
## Why Frontier-CS?
Current benchmarks are becoming too easy. Models score 90%+ on many existing coding benchmarks, but that doesn't mean they can actually do useful research or solve real-world engineering challenges.
**Frontier-CS is different:**
| | Traditional Benchmarks | Frontier-CS |
| ---------- | ------------------------------------------ | ------------------------------------------------------- |
| Difficulty | Often saturated with evolving intelligence | _Unsolved_: no solution has achieved perfect scores |
| Problems | Textbook-style, known solutions | _Open-ended_ research & optimization challenges |
| Evaluation | Binary pass-or-fail | _Verifiable_ continuous scoring, always room to improve |
| Scope | Usually one domain | _Diverse_: systems, ML, algorithms, security, and more |
## 🏆 Leaderboard Snapshot (01/29/2026)
Score@k = best-of-k runs; Avg@k = average over k runs; Elo uses Bradley–Terry from single-attempt performance (difficulty-normalized).
<a id="algorithmic-track"></a>
### Algorithmic Track (172 problems)
| Rank | Model | Score@1 | Avg@5 | Score@5 | Elo |
|:---:|---|---:|---:|---:|---:|
| 🥇 | Gemini 3.0 Pro | **33.12** | **34.58** | **56.09** | **1265** |
| 🥈 | GPT 5.2 Thinking | 32.40 | 33.11 | 47.19 | 1242 |
| 🥉 | GPT 5 Thinking | 23.10 | 22.58 | 39.73 | 1196 |
| 4 | DeepSeek 3.2 | 24.83 | 23.89 | 41.44 | 1193 |
| 5 | Grok 4 | 24.04 | 22.98 | 36.81 | 1174 |
| 6 | Gemini 2.5 Pro | 20.34 | 19.32 | 36.65 | 1167 |
| 7 | GPT 5.1 Thinking | 20.64 | 21.49 | 34.76 | 1164 |
**Human reference: <b>86.99</b> (Score@1).**
<a id="research-track"></a>
### Research Track (68 problems)
| Rank | Model | Score@1 | Avg@5 | Score@5 | Elo |
|:---:|---|---:|---:|---:|---:|
| 🥇 | Gemini 3.0 Pro | **46.55** | **43.14** | **59.22** | **1283** |
| 🥈 | GPT 5 Thinking | 30.91 | 34.94 | 55.25 | 1218 |
| 🥉 | GPT 5.1 Thinking | 32.12 | 33.70 | 56.79 | 1214 |
| 4 | GPT 5.2 Thinking | 30.29 | 34.09 | 58.90 | 1210 |
| 5 | Gemini 2.5 Pro | 21.66 | 25.74 | 51.57 | 1180 |
| 6 | Grok 4 | 26.75 | 24.01 | 48.15 | 1149 |
| 7 | DeepSeek 3.2 | 21.51 | 21.76 | 44.41 | 1146 |
## Getting Started
### Installation
**Requirements:** Python 3.11+, Docker 24+ (for local evaluation)
```bash
git clone https://github.com/FrontierCS/Frontier-CS.git
cd Frontier-CS
# Install dependencies (using uv, recommended)
uv sync
# Or with pip:
pip install -e .
```
### Try it yourself
Here's [Algorithmic Problem 0](algorithmic/problems/0/statement.txt) - try to beat GPT-5!
```bash
# Run the example solution (Human Expert Solution)
frontier eval algorithmic 0 algorithmic/problems/0/examples/reference.cpp
# Run the example solution (GPT-5 Thinking Solution)
frontier eval algorithmic 0 algorithmic/problems/0/examples/gpt5.cpp
# Try your own solution!
frontier eval algorithmic 0 <your_solution.cpp>
```
<p align="center">
<img src="assets/teaser.png" alt="Example polyomino packing solution visualized with scripts/viz.py" width="800"/>
</p>
### Research Problems
```bash
# List all problems
frontier list research
# Evaluate (uses SkyPilot by default, requires `sky check`)
frontier eval research flash_attn <your_solution.py>
# Use Docker instead (no cloud setup needed)
frontier eval research flash_attn <your_solution.py> --backend docker
```
See [research/README.md](research/README.md) for full documentation.
### Algorithmic Problems
```bash
# Evaluate (uses Docker by default)
frontier eval algorithmic 1 <your_solution.cpp>
# Use SkyPilot instead
frontier eval algorithmic 1 <your_solution.cpp> --backend skypilot
```
See [algorithmic/README.md](algorithmic/README.md) for full documentation.
### Raw Score
Frontier-CS supports unbounded scoring, enabling open-ended evaluation compatible with algorithm evolution frameworks such as OpenEvolve.
```bash
# Get unbounded score (without clipping to 100)
frontier eval research flash_attn <your_solution.py> --unbounded
frontier eval algorithmic 1 <your_solution.cpp> --unbounded
```
### Python API
```python
from frontier_cs import SingleEvaluator
evaluator = SingleEvaluator()
# Evaluate a research problem
result = evaluator.evaluate("research", problem_id="flash_attn", code=my_code)
print(f"Score: {result.score}")
# Evaluate an algorithmic problem
result = evaluator.evaluate("algorithmic", problem_id=1, code=cpp_code)
print(f"Score: {result.score}")
# Get unbounded score for algorithmic problems
result = evaluator.evaluate("algorithmic", problem_id=1, code=cpp_code, unbounded=True)
print(f"Score (bounded): {result.score}")
print(f"Score (unbounded): {result.score_unbounded}")
```
See `ARCHITECTURE.md` for an overview of the evaluation stack
and runner mapping.
### Batch Evaluation
For testing your solutions at scale with public test cases.
**Solution directory structure:**
```
{track}/solutions/
{problem}/
{model}.py # variant 0
{model}_1.py # variant 1
{model}_2.py # variant 2
```
Example for research track:
```
research/solutions/
flash_attn/
gpt5.py
claude4.5sonnet.py
cross_entropy/
gpt5.py
```
**Basic usage:**
```bash
# Evaluate all research solutions (uses SkyPilot by default)
frontier batch research
# Evaluate all algorithmic solutions (uses Docker by default)
frontier batch algorithmic
# Filter by model or problem
frontier batch research --model gpt5.1
frontier batch research --problem flash_attn
# Override default backend
frontier batch research --backend docker
frontier batch algorithmic --backend skypilot
```
**Custom solutions directory:** You can test solutions from a custom directory with the same structure:
```bash
frontier batch research --solutions-dir ./my_solutions
```
Results are saved to `./results/batch/{track}/` by default. The state file tracks which (solution, problem) pairs have been evaluated, so you can:
- Resume interrupted evaluations automatically
- Run multiple times with different `--solutions-dir` and results accumulate
See `--help` for all options.
> **Note:** For maintainers, `./scripts/run_eval.sh` is used for full evaluation with private test cases.
## Evaluating and Submitting Results
Reference solutions and full test cases are withheld. We release partial test cases so you can develop and debug locally. For the complete evaluation workflow (preparing solutions, running batch evaluation, viewing results, and submitting to the leaderboard), see [SUBMIT.md](SUBMIT.md) and submit your solutions to qmang@berkeley.edu, wenhao.chai@princeton.edu, huanzhimao@berkeley.edu, or zhifei.li@berkeley.edu.
Questions? Join our [Discord](https://discord.gg/k4hd2nU4UE)
## Acknowledgments
Some problems are adapted from [ALE-bench](https://github.com/SakanaAI/ALE-Bench) and [AI-Driven Research for Systems (ADRS)](https://ucbskyadrs.github.io/).
## Citing Us
If you use Frontier-CS in your research, please cite:
```bibtex
@misc{mang2025frontiercsevolvingchallengesevolving,
title={FrontierCS: Evolving Challenges for Evolving Intelligence},
author = {Qiuyang Mang and Wenhao Chai and Zhifei Li and Huanzhi Mao and
Shang Zhou and Alexander Du and Hanchen Li and Shu Liu and
Edwin Chen and Yichuan Wang and Xieting Chu and Zerui Cheng and
Yuan Xu and Tian Xia and Zirui Wang and Tianneng Shi and
Jianzhu Yao and Yilong Zhao and Qizheng Zhang and Charlie Ruan and
Zeyu Shen and Kaiyuan Liu and Runyuan He and Dong Xing and
Zerui Li and Zirong Zeng and Yige Jiang and Lufeng Cheng and
Ziyi Zhao and Youran Sun and Wesley Zheng and Meiyuwang Zhang and
Ruyi Ji and Xuechang Tu and Zihan Zheng and Zexing Chen and
Kangyang Zhou and Zhaozi Wang and Jingbang Chen and
Aleksandra Korolova and Peter Henderson and Pramod Viswanath and
Vijay Ganesh and Saining Xie and Zhuang Liu and Dawn Song and
Sewon Min and Ion Stoica and Joseph E. Gonzalez and
Jingbo Shang and Alvin Cheung},
year={2025},
eprint={2512.15699},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2512.15699},
}
```
|