|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen3-32B |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<h1> <a href="http://blog.goedel-prover.com"> <strong>Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date</strong></a></h1> |
|
|
</div> |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](http://blog.goedel-prover.com) |
|
|
[](https://github.com/Goedel-LM/Goedel-Prover-V2) |
|
|
[](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B) |
|
|
[](https://arxiv.org/abs/2508.03613) |
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
|
|
|
</div> |
|
|
|
|
|
## 1. Introduction |
|
|
|
|
|
We introduce Goedel-Prover-V2, an open-source language model series that sets a new state-of-the-art in automated formal proof generation. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) <strong>Scaffolded data synthesis</strong>: We generate synthetic proof tasks of increasing difficulty to progressively train the model, enabling it to master increasingly complex theorems; (2) <strong>Verifier-guided self-correction</strong>: The model learns to iteratively revise its own proofs by leveraging feedback from Lean’s compiler, closely mimicking how humans refine their work; (3) <strong>Model averaging</strong>: We combine multiple model checkpoints to improve robustness and overall performance. |
|
|
|
|
|
Our small model, Goedel-Prover-V2-8B, reaches 84.6% on MiniF2F test set at Pass@32, matching the performance of prior state-of-the-art DeepSeek-Prover-V2-671B while being nearly 100 times smaller in model size. Our flagship model, Goedel-Prover-V2-32B, achieves 88.0% on MiniF2F at Pass@32 on standard mode and 90.4% on self-correction mode, outperforming prior SOTA DeepSeek-Prover-V2-671B and concurrent work Kimina-Prover-72B by a large margin. Additionaly, our flagship model with self-correction solves 64 problems on PutnamBench at Pass@64, securing the 1st on the leaderboard surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by Pass@1024. |
|
|
|
|
|
## 2. Benchmark Performance |
|
|
|
|
|
<strong>Self-correction mode</strong>: Our model improves proof quality by first generating an initial candidate and then using Lean compiler feedback to iteratively revise it. We perform two rounds of self-correction, which remain computationally efficient—the total output length (including the initial proof and two revisions) increases only modestly from the standard 32K to 40K tokens. |
|
|
|
|
|
|
|
|
<style> |
|
|
.fig-row { |
|
|
display: flex; |
|
|
justify-content: space-between; /* spread them out */ |
|
|
align-items: flex-start; /* align tops */ |
|
|
gap: 1rem; /* space between images */ |
|
|
} |
|
|
.fig-row img { |
|
|
display: block; |
|
|
width: 100%; |
|
|
height: auto; |
|
|
} |
|
|
.fig-row .panel { |
|
|
/* override per‐panel width as needed */ |
|
|
/* e.g. .panel-1 { width:25%; } .panel-2 { width:40%; } etc. */ |
|
|
} |
|
|
figure { |
|
|
margin: 0; |
|
|
} |
|
|
figure figcaption { |
|
|
text-align: center; |
|
|
font-size: 0.9em; |
|
|
margin-top: 0.75rem; |
|
|
color: #555; |
|
|
} |
|
|
figure figcaption strong { |
|
|
font-weight: bold; |
|
|
} |
|
|
/* Italicize the rest of the caption */ |
|
|
figure figcaption em { |
|
|
font-style: italic; |
|
|
} |
|
|
</style> |
|
|
|
|
|
<figure> |
|
|
<div class="fig-row"> |
|
|
<div class="panel panel-1" style="width:100%;"> |
|
|
<img src="https://github.com/Goedel-LM/Goedel-Prover-V2/blob/main/assets/combined_performance_plots_varied_width.png?raw=true" alt="…"> |
|
|
</div> |
|
|
</div> |
|
|
<figcaption> |
|
|
<strong>Figure 1</strong>: <em>Pass@32 performance on MiniF2F, PutnamBench, and our new MathOlympiadBench containing 360 IMO-level problems.</em> |
|
|
</figcaption> |
|
|
</figure> |
|
|
|
|
|
The charts above demonstrate the state-of-the-art performance of Goedel-Prover-V2. We report all numbers at Pass@32: (1) Across all three datasets, our flagship 32B model, in both standard and self-correction mode, significantly outperforms prior state-of-the-art DeepSeek-Prover-V2-671B and Kimina-Prover-72B; (2) on miniF2F, our 8B model matches the performance of DeepSeek-Prover-V2-671B while being 100 times smaller in model size. |
|
|
|
|
|
|
|
|
<div align="center"> |
|
|
<table style="margin: 0 auto;"> |
|
|
<thead> |
|
|
<tr> |
|
|
<th>#</th> |
|
|
<th>Model</th> |
|
|
<th>num‑solved</th> |
|
|
<th>compute</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr><td>1</td><td><strong>Goedel-Prover-V2-32B (self-correction mode)</strong></td><td><strong>86</strong></td><td><strong>Pass@192</strong></td></tr> |
|
|
<tr><td>1</td><td><strong>Goedel-Prover-V2-32B (self-correction mode)</strong></td><td><strong>57</strong></td><td><strong>Pass@32</strong></td></tr> |
|
|
<tr><td>1</td><td><strong>Goedel-Prover-V2-32B</strong></td><td><strong>43</strong></td><td><strong>Pass@32</strong></td></tr> |
|
|
<tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>47</td><td>Pass@1024</td></tr> |
|
|
<tr><td>2</td><td>DeepSeek‑Prover‑V2-671B</td><td>22</td><td>Pass@32</td></tr> |
|
|
<tr><td>3</td><td>DSP+</td><td>23</td><td>Pass@128</td></tr> |
|
|
<tr><td>4</td><td>Kimina‑Prover‑7B‑Distill</td><td>10</td><td>Pass@192</td></tr> |
|
|
<tr><td>5</td><td>Self-play Theorem Prover</td><td>8</td><td>Pass@3200</td></tr> |
|
|
<tr><td>6</td><td>Goedel-Prover-V1</td><td>7</td><td>Pass@512</td></tr> |
|
|
</tbody> |
|
|
</table> |
|
|
<!-- table caption --> |
|
|
<caption align="bottom"><strong>Table 1</strong>: <em>PutnamBench leaderboard. Goedel-Prover-V2-32B secures the top rank with significantly less compute (pass number) than the previous state-of-the-art.</em> |
|
|
</div> |
|
|
|
|
|
## 3. Compelling Scaling Performance |
|
|
|
|
|
<style> |
|
|
.fig-row { |
|
|
display: flex; |
|
|
justify-content: space-between; /* spread them out */ |
|
|
align-items: flex-start; /* align tops */ |
|
|
gap: 1rem; /* space between images */ |
|
|
} |
|
|
.fig-row img { |
|
|
display: block; |
|
|
width: 100%; |
|
|
height: auto; |
|
|
} |
|
|
.fig-row .panel { |
|
|
/* override per‐panel width as needed */ |
|
|
/* e.g. .panel-1 { width:25%; } .panel-2 { width:40%; } etc. */ |
|
|
} |
|
|
figure { |
|
|
margin: 0; |
|
|
} |
|
|
figure figcaption { |
|
|
text-align: center; |
|
|
font-size: 0.9em; |
|
|
margin-top: 0.75rem; |
|
|
color: #555; |
|
|
} |
|
|
figure figcaption strong { |
|
|
font-weight: bold; |
|
|
} |
|
|
/* Italicize the rest of the caption */ |
|
|
figure figcaption em { |
|
|
font-style: italic; |
|
|
} |
|
|
</style> |
|
|
|
|
|
<figure> |
|
|
<div class="fig-row"> |
|
|
<div class="panel panel-1" style="width:80%;"> |
|
|
<img src="https://github.com/Goedel-LM/Goedel-Prover-V2/blob/main/assets/inference_scale_performance.png?raw=true" alt="…"> |
|
|
</div> |
|
|
</div> |
|
|
<figcaption> |
|
|
<strong>Figure 2</strong>: <em>Performance on MiniF2F test set under different sample budgets.</em> |
|
|
</figcaption> |
|
|
</figure> |
|
|
|
|
|
The scaling curves above show that our 32B model consistently outperforms all prior state-of-the-art models across the entire range of inference-time compute budgets. |
|
|
|
|
|
## 4. Model & Dataset Downloads |
|
|
|
|
|
We release our Goedel-Prover-V2 models and the new MathOlympiadBench benchmark to foster future research. |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
| Model | Download | |
|
|
| -------- | -------- | |
|
|
| Goedel-Prover-V2-32B | [🤗Download](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B) | |
|
|
| Goedel-Prover-V2-8B | [🤗Download](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-8B) | |
|
|
|
|
|
</div> |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
| Dataset | Download | |
|
|
| -------- | -------- | |
|
|
| MathOlympiadBench | [🤗Download](https://huggingface.co/datasets/Goedel-LM/MathOlympiadBench) | |
|
|
|
|
|
</div> |
|
|
|
|
|
<strong>MathOlympiadBench</strong> (Math Olympiad Bench) comprises human-verified formalizations of Olympiad-level mathematical competition problems, sourced from [Compfiles](https://dwrensha.github.io/compfiles/imo.html) and [IMOSLLean4 repository](https://github.com/mortarsanjaya/IMOSLLean4). MathOlympiadBench contains 360 problems, including 158 IMO problems from 1959 to 2024, 131 IMO shortlist problems covering 2006 to 2023, 68 regional mathematical Olympiad problems, and 3 additional mathematical puzzles. |
|
|
|
|
|
This model is being released to aid other open-source projects, including those geared towards the upcoming IMO competition. A full paper with all details will be released in the coming weeks. |
|
|
|
|
|
## 5. Quick Start |
|
|
You can directly use [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference. |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
torch.manual_seed(30) |
|
|
|
|
|
model_id = "Goedel-LM/Goedel-Prover-V2-32B" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True) |
|
|
|
|
|
|
|
|
formal_statement = """ |
|
|
import Mathlib |
|
|
import Aesop |
|
|
|
|
|
set_option maxHeartbeats 0 |
|
|
|
|
|
open BigOperators Real Nat Topology Rat |
|
|
|
|
|
|
|
|
theorem square_equation_solution {x y : ℝ} (h : x^2 + y^2 = 2*x - 4*y - 5) : x + y = -1 := by |
|
|
sorry |
|
|
""".strip() |
|
|
|
|
|
prompt = """ |
|
|
Complete the following Lean 4 code: |
|
|
|
|
|
```lean4 |
|
|
{}``` |
|
|
|
|
|
Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies. |
|
|
The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof. |
|
|
""".strip() |
|
|
|
|
|
chat = [ |
|
|
{"role": "user", "content": prompt.format(formal_statement)}, |
|
|
] |
|
|
|
|
|
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device) |
|
|
|
|
|
import time |
|
|
start = time.time() |
|
|
outputs = model.generate(inputs, max_new_tokens=32768) |
|
|
print(tokenizer.batch_decode(outputs)) |
|
|
print(time.time() - start) |
|
|
``` |
|
|
|
|
|
### Cite |
|
|
```bibtex |
|
|
@article{lin2025goedelproverv2, |
|
|
title={Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction}, |
|
|
author={Lin, Yong and Tang, Shange and Lyu, Bohan and Yang, Ziran and Chung, Jui-Hui and Zhao, Haoyu and Jiang, Lai and Geng, Yihan and Ge, Jiawei and Sun, Jingruo and others}, |
|
|
journal={arXiv preprint arXiv:2508.03613}, |
|
|
year={2025} |
|
|
} |
|
|
``` |