Instructions to use Pythagoras-LM/Pythagoras-Prover-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Pythagoras-LM/Pythagoras-Prover-4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Pythagoras-LM/Pythagoras-Prover-4B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Pythagoras-LM/Pythagoras-Prover-4B")
model = AutoModelForCausalLM.from_pretrained("Pythagoras-LM/Pythagoras-Prover-4B", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Pythagoras-LM/Pythagoras-Prover-4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Pythagoras-LM/Pythagoras-Prover-4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pythagoras-LM/Pythagoras-Prover-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Pythagoras-LM/Pythagoras-Prover-4B

SGLang

How to use Pythagoras-LM/Pythagoras-Prover-4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Pythagoras-LM/Pythagoras-Prover-4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pythagoras-LM/Pythagoras-Prover-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Pythagoras-LM/Pythagoras-Prover-4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Pythagoras-LM/Pythagoras-Prover-4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Pythagoras-LM/Pythagoras-Prover-4B with Docker Model Runner:
```
docker model run hf.co/Pythagoras-LM/Pythagoras-Prover-4B
```

Pythagoras-Prover

1. Introduction

We introduce Pythagoras-Prover, a compute-efficient family of open-source large language models for formal theorem proving in Lean 4. The family comprises two autoregressive provers at 4B and 32B parameters, together with Pythagoras-Prover-Diffusion, the first diffusion-based theorem prover, which iteratively refines Lean proofs at inference time. All three models are artefacts of a single methodological approach: a scalable, Lean-verified synthetic data pipeline. At its centre is Augmented Lean Formalisation (ALF), a structured mutation scheme that expands a verified seed corpus into formal variants without per-instance Lean compilation, then re-uses them as a self-distillation signal during training. This design lets careful data construction stand in for raw scale, closing much of the gap between small open provers and their largest counterparts — without relying on inference-time self-correction.

2. Model Summary

A Lean-Verified Synthetic Data Pipeline

Natural-language problems from general-math and competition sources are autoformalised into Lean and gated on the type-checker using predominantly sub-30B open models, with an auto-informalisation and alignment step discarding faithful-but-wrong formalisations to yield a verified seed corpus partitioned into easy, medium, and hard tiers.
A rubric-guided distillation stage re-prompts on each rejected instance to target the specific Lean type-checker error responsible for its failure, lifting autoformalisation success and roughly doubling the verified training set.

Model Training

LoRA-only supervised fine-tuning of Qwen3-4B and Qwen3-32B under an 8K context, paired with a dynamic proof-reasoning filter and a difficulty-ordered easy→medium→hard curriculum, followed by reinforcement learning with a Lean-compilation reward and a final continued-SFT stage on the ALF corpus.

Augmented Lean Formalisation

ALF emits one structured variant per category — simplification, generalisation, lemma proposal, proof-step decomposition, and reformulation — for every seed statement, replacing per-instance Lean verification with a cheap statement-alignment check and expanding the seed corpus into roughly 2M formal variants.
The post-RL prover proves the mutations, and these self-distilled proofs form a corpus that trains both the autoregressive and diffusion provers from a single recipe.

The Smallest Efficient Open-Source Lean Theorem Prover

We train Pythagoras-Prover-4B, one of the smallest and most compute-efficient open-source Lean theorem provers to date, reaching 86.07% on MiniF2F-Test at Pass@32 and surpassing the prior state of the art at a fraction of its parameter count.
Pythagoras-Prover-Diffusion adapts a block-diffusion formulation with a tactic-based masking objective aligned to the discrete reasoning steps of Lean — to our knowledge the first demonstration that a diffusion language model can verifiably solve Lean theorems at non-trivial rates.

The resulting models set a new bar for compute-efficient formal proving. Pythagoras-Prover-32B achieves state-of-the-art performance among open-source provers, reaching 93.03% on MiniF2F-Test and solving 93 of 672 problems on PutnamBench, while Pythagoras-Prover-4B outperforms DeepSeek-Prover-V2-671B on MiniF2F-Test despite being roughly 167× smaller — with no self-correction or test-time reinforcement learning. We additionally release MiniF2F-ALF, an ALF-mutated companion benchmark on which every evaluated prover degrades.

3. Benchmark Performance

We evaluate Pythagoras-Prover on three Lean 4 benchmarks — MiniF2F-Test, PutnamBench, and the MiniF2F-ALF benchmark we introduce — under a single unified protocol (Lean 4.9.0-rc1, a 30,000-token generation limit, and a verbatim-statement pass criterion). Across all three, Pythagoras-Prover matches or exceeds open-source provers an order of magnitude larger, and does so without inference-time self-correction or test-time reinforcement learning.

Method	#Params	Pass@32	Pass@1024	Best (N)
Goedel-Prover-SFT	7B	57.6	–	62.7 (3200)
STP	7B	–	–	67.6 (25600)
Kimina-Prover-Preview-72B	72B	68.85	–	80.74 (8192)
DeepSeek-Prover-V2-7B	7B	75.6	–	82.0 (8192)
DeepSeek-Prover-V2-671B	671B	82.4	–	88.9 (8192)
Kimina-Prover-8B-Distill	8B	77.86	–	–
Kimina-Prover-70B	70B	84.0	87.7	92.2 (TTRL)
Goedel-Prover-V2-8B	8B	84.6	87.9	90.2 (8192)
+ Self-Correction	8B	86.7	89.3	–
Goedel-Prover-V2-32B	32B	88.1	91.8	92.2 (8192)
+ Self-Correction	32B	90.4	92.6	–
Pythagoras-Prover-4B	4B	86.07	88.11	89.75 (2048)
Pythagoras-Prover-32B	32B	89.75	92.62	93.03 (2048)

Table 1: Pythagoras-Prover-4B exceeds DeepSeek-Prover-V2-671B's pass@8192 result (88.9%) at pass@2048 — a quarter of the budget and ~167× fewer parameters. Pythagoras-Prover-32B sets the strongest reported MiniF2F-Test pass rate without self-correction or test-time RL.

#	Model	num-solved	compute
1	Pythagoras-Prover-32B	93	Pass@2048
1	Pythagoras-Prover-32B	59	Pass@64
1	Pythagoras-Prover-32B	48	Pass@32
2	Goedel-Prover-V2-32B (self-correction mode)	86	Pass@184
2	Goedel-Prover-V2-32B (self-correction mode)	57	Pass@32
2	Goedel-Prover-V2-32B	43	Pass@32
3	DeepSeek-Prover-V2-671B	47	Pass@1024
3	DeepSeek-Prover-V2-671B	22	Pass@32
4	DSP+	23	Pass@128
5	Bourbaki	14	Pass@512
6	Kimina-Prover-7B-Distill	10	Pass@192
7	Self-play Theorem Prover	8	Pass@3200
8	Goedel-Prover-SFT	7	Pass@512
9	ABEL (closed-source)	7	Pass@596

Table 2: PutnamBench leaderboard (problems solved out of 657). Pythagoras-Prover-32B takes the top rank, solving 93 problems at Pass@2048 — 7 more than the previous best (Goedel-Prover-V2-32B, 86 at Pass@184 in self-correction mode) and nearly double DeepSeek-Prover-V2-671B's 47 at Pass@1024, despite being roughly 20× smaller. Seed-Prover (331 solved) is omitted from the ranked rows as it is closed-source with undisclosed test-time compute.

Model	Pass@32
DeepSeek-Prover-V2-671B	79.71
Goedel-Prover-V2-8B	82.58
Goedel-Prover-V2-32B	83.61
Pythagoras-Prover-4B	83.19
Pythagoras-Prover-32B	85.04

Table 3: Performance of current state-of-the-art provers on MiniF2F-ALF (Pass@32, %). As MiniF2F-ALF is introduced in this work, all results are evaluated by us under a unified setup.

4. Model & Dataset Downloads

We release our Pythagoras-Prover models, the training dataset and the new MiniF2F-ALF benchmark for future future research.

Model	Download
Pythagoras-Prover-32B	🤗Download
Pythagoras-Prover-4B	🤗Download
Pythagoras-Prover-Diffusion-4B	🤗Download

Dataset	Download
Pythagoras-Prover-SFT	🤗Download
Pythagoras-Prover-Distill-4B	🤗Download
Pythagoras-Prover-Distill-32B	🤗Download

5. Quick Start

For model inference., Huggingface's Transformers can directly be used

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Pythagoras-LM/Pythagoras-Prover-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

formal_statement = """
import Mathlib
import Aesop

set_option maxHeartbeats 0

open BigOperators Real Nat Topology Rat

/-- The volume of a cone is given by the formula $V = \frac{1}{3}Bh$, where $B$ is the area of the base and $h$ is the height. The area of the base of a cone is 30 square units, and its height is 6.5 units. What is the number of cubic units in its volume? Show that it is 65.-/
theorem mathd_algebra_478 (b h v : ℝ) (h₀ : 0 < b ∧ 0 < h ∧ 0 < v) (h₁ : v = 1 / 3 * (b * h))
(h₂ : b = 30) (h₃ : h = 13 / 2) : v = 65 := by
  sorry
""".strip()

prompt = """
Complete the following Lean 4 code:

```lean4
{}```

Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies.
The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof.
""".strip()

chat = [
  {"role": "user", "content": prompt.format(formal_statement)},
]

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

import time
start = time.time()
outputs = model.generate(inputs, max_new_tokens=8192)
print(tokenizer.batch_decode(outputs))
print(time.time() - start)

Cite

@article{leang2026pythagoras,
  title={Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation},
  author={Leang, Joshua Ong Jun and Zhao, Zheng and Stoian, Mihaela Catalina and Xu, Qiyuan and Li, Haonan and Li, Wenda and Cohen, Shay B. and Giunchiglia, Eleonora},
  journal={arXiv preprint arXiv:2606.12594},
  year={2026}
}