Pythagoras-Prover logo

Pythagoras-Prover

Website Paper GitHub License

1. Introduction

We introduce Pythagoras-Prover, a compute-efficient family of open-source large language models for formal theorem proving in Lean 4. The family comprises two autoregressive provers at 4B and 32B parameters, together with Pythagoras-Prover-Diffusion, the first diffusion-based theorem prover, which iteratively refines Lean proofs at inference time. All three models are artefacts of a single methodological approach: a scalable, Lean-verified synthetic data pipeline. At its centre is Augmented Lean Formalisation (ALF), a structured mutation scheme that expands a verified seed corpus into formal variants without per-instance Lean compilation, then re-uses them as a self-distillation signal during training. This design lets careful data construction stand in for raw scale, closing much of the gap between small open provers and their largest counterparts — without relying on inference-time self-correction.

Pythagoras-Prover benchmark overview

2. Model Summary


A Lean-Verified Synthetic Data Pipeline

  • Natural-language problems from general-math and competition sources are autoformalised into Lean and gated on the type-checker using predominantly sub-30B open models, with an auto-informalisation and alignment step discarding faithful-but-wrong formalisations to yield a verified seed corpus partitioned into easy, medium, and hard tiers.
  • A rubric-guided distillation stage re-prompts on each rejected instance to target the specific Lean type-checker error responsible for its failure, lifting autoformalisation success and roughly doubling the verified training set.

Model Training

  • LoRA-only supervised fine-tuning of Qwen3-4B and Qwen3-32B under an 8K context, paired with a dynamic proof-reasoning filter and a difficulty-ordered easy→medium→hard curriculum, followed by reinforcement learning with a Lean-compilation reward and a final continued-SFT stage on the ALF corpus.

Augmented Lean Formalisation

  • ALF emits one structured variant per category — simplification, generalisation, lemma proposal, proof-step decomposition, and reformulation — for every seed statement, replacing per-instance Lean verification with a cheap statement-alignment check and expanding the seed corpus into roughly 2M formal variants.
  • The post-RL prover proves the mutations, and these self-distilled proofs form a corpus that trains both the autoregressive and diffusion provers from a single recipe.

The Smallest Efficient Open-Source Lean Theorem Prover

  • We train Pythagoras-Prover-4B, one of the smallest and most compute-efficient open-source Lean theorem provers to date, reaching 86.07% on MiniF2F-Test at Pass@32 and surpassing the prior state of the art at a fraction of its parameter count.
  • Pythagoras-Prover-Diffusion adapts a block-diffusion formulation with a tactic-based masking objective aligned to the discrete reasoning steps of Lean — to our knowledge the first demonstration that a diffusion language model can verifiably solve Lean theorems at non-trivial rates.

The resulting models set a new bar for compute-efficient formal proving. Pythagoras-Prover-32B achieves state-of-the-art performance among open-source provers, reaching 93.03% on MiniF2F-Test and solving 93 of 672 problems on PutnamBench, while Pythagoras-Prover-4B outperforms DeepSeek-Prover-V2-671B on MiniF2F-Test despite being roughly 167× smaller — with no self-correction or test-time reinforcement learning. We additionally release MiniF2F-ALF, an ALF-mutated companion benchmark on which every evaluated prover degrades.

3. Benchmark Performance

We evaluate Pythagoras-Prover on three Lean 4 benchmarks — MiniF2F-Test, PutnamBench, and the MiniF2F-ALF benchmark we introduce — under a single unified protocol (Lean 4.9.0-rc1, a 30,000-token generation limit, and a verbatim-statement pass criterion). Across all three, Pythagoras-Prover matches or exceeds open-source provers an order of magnitude larger, and does so without inference-time self-correction or test-time reinforcement learning.

Method #Params Pass@32 Pass@1024 Best (N)
Goedel-Prover-SFT7B57.662.7 (3200)
STP7B67.6 (25600)
Kimina-Prover-Preview-72B72B68.8580.74 (8192)
DeepSeek-Prover-V2-7B7B75.682.0 (8192)
DeepSeek-Prover-V2-671B671B82.488.9 (8192)
Kimina-Prover-8B-Distill8B77.86
Kimina-Prover-70B70B84.087.792.2 (TTRL)
Goedel-Prover-V2-8B8B84.687.990.2 (8192)
  + Self-Correction8B86.789.3
Goedel-Prover-V2-32B32B88.191.892.2 (8192)
  + Self-Correction32B90.492.6
Pythagoras-Prover-4B4B86.0788.1189.75 (2048)
Pythagoras-Prover-32B32B89.7592.6293.03 (2048)
Table 1: Pythagoras-Prover-4B exceeds DeepSeek-Prover-V2-671B's pass@8192 result (88.9%) at pass@2048 — a quarter of the budget and ~167× fewer parameters. Pythagoras-Prover-32B sets the strongest reported MiniF2F-Test pass rate without self-correction or test-time RL.

# Model num-solved compute
1Pythagoras-Prover-32B93Pass@2048
1Pythagoras-Prover-32B59Pass@64
1Pythagoras-Prover-32B48Pass@32
2Goedel-Prover-V2-32B (self-correction mode)86Pass@184
2Goedel-Prover-V2-32B (self-correction mode)57Pass@32
2Goedel-Prover-V2-32B43Pass@32
3DeepSeek-Prover-V2-671B47Pass@1024
3DeepSeek-Prover-V2-671B22Pass@32
4DSP+23Pass@128
5Bourbaki14Pass@512
6Kimina-Prover-7B-Distill10Pass@192
7Self-play Theorem Prover8Pass@3200
8Goedel-Prover-SFT7Pass@512
9ABEL (closed-source)7Pass@596
Table 2: PutnamBench leaderboard (problems solved out of 657). Pythagoras-Prover-32B takes the top rank, solving 93 problems at Pass@2048 — 7 more than the previous best (Goedel-Prover-V2-32B, 86 at Pass@184 in self-correction mode) and nearly double DeepSeek-Prover-V2-671B's 47 at Pass@1024, despite being roughly 20× smaller. Seed-Prover (331 solved) is omitted from the ranked rows as it is closed-source with undisclosed test-time compute.

Model Pass@32
DeepSeek-Prover-V2-671B79.71
Goedel-Prover-V2-8B82.58
Goedel-Prover-V2-32B83.61
Pythagoras-Prover-4B83.19
Pythagoras-Prover-32B85.04
Table 3: Performance of current state-of-the-art provers on MiniF2F-ALF (Pass@32, %). As MiniF2F-ALF is introduced in this work, all results are evaluated by us under a unified setup.

4. Model & Dataset Downloads

We release our Pythagoras-Prover models, the training dataset and the new MiniF2F-ALF benchmark for future future research.

Model Download
Pythagoras-Prover-32B Coming Soon
Pythagoras-Prover-4B 🤗Download
Pythagoras-Prover-Diffusion-4B Coming Soon
Dataset Download
Pythagoras-Prover-SFT 🤗Download
Pythagoras-Prover-Distill-4B Coming Soon
Pythagoras-Prover-Distill-32B Coming Soon

5. Quick Start

For model inference., Huggingface's Transformers can directly be used

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Pythagoras-LM/Pythagoras-Prover-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

formal_statement = """
import Mathlib
import Aesop

set_option maxHeartbeats 0

open BigOperators Real Nat Topology Rat

/-- The volume of a cone is given by the formula $V = \frac{1}{3}Bh$, where $B$ is the area of the base and $h$ is the height. The area of the base of a cone is 30 square units, and its height is 6.5 units. What is the number of cubic units in its volume? Show that it is 65.-/
theorem mathd_algebra_478 (b h v : ℝ) (h₀ : 0 < b ∧ 0 < h ∧ 0 < v) (h₁ : v = 1 / 3 * (b * h))
(h₂ : b = 30) (h₃ : h = 13 / 2) : v = 65 := by
  sorry
""".strip()

prompt = """
Complete the following Lean 4 code:

```lean4
{}```

Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies.
The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof.
""".strip()

chat = [
  {"role": "user", "content": prompt.format(formal_statement)},
]

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

import time
start = time.time()
outputs = model.generate(inputs, max_new_tokens=8192)
print(tokenizer.batch_decode(outputs))
print(time.time() - start)

Cite

@article{leang2026pythagoras,
  title={Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation},
  author={Leang, Joshua Ong Jun and Zhao, Zheng and Stoian, Mihaela Catalina and Xu, Qiyuan and Li, Haonan and Li, Wenda and Cohen, Shay B. and Giunchiglia, Eleonora},
  journal={arXiv preprint arXiv:2606.12594},
  year={2026}
}
Downloads last month
49
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Pythagoras-LM/Pythagoras-Prover-4B

Finetuned
Qwen/Qwen3-4B
Finetuned
(716)
this model

Dataset used to train Pythagoras-LM/Pythagoras-Prover-4B

Paper for Pythagoras-LM/Pythagoras-Prover-4B