File size: 6,397 Bytes
f89fc92
2e034db
f89fc92
 
 
 
 
 
6b3e5d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f89fc92
 
6b3e5d2
 
 
 
 
f30fae6
6b3e5d2
f30fae6
6b3e5d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32cd499
6b3e5d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32cd499
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b3e5d2
 
 
 
 
 
 
 
 
 
 
 
 
32cd499
6b3e5d2
 
32cd499
 
 
f69e30e
32cd499
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: apache-2.0
language:
- en
base_model:
- deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
---

<div align="center">
  <img src="https://github.com/eth-sri/matharena/blob/main/images/matharena_icon.png?raw=true" width="20%" alt="MathArena" />
</div>
<hr>
<div align="center" style="line-height: 1;">
  <a href="https://proofcorpus.ai/" target="_blank" style="margin: 2px;">
    <img alt="Homepage" src="https://img.shields.io/badge/Homepage-OPC-ffc107?color=00c107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://github.com/insait-institute/open-proof-corpus/" target="_blank" style="margin: 2px;">
    <img alt="GitHub" src="https://img.shields.io/badge/GitHub-OPC-ffc107?logo=github&color=c10000&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://huggingface.co/collections/INSAIT-Institute/open-proof-corpus-6856d1279a7b6f12b83dc886" target="_blank" style="margin: 2px;">
    <img alt="OPC HuggingFace" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-OPC-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://github.com/insait-institute/open-proof-corpus/blob/master/LICENSE.md" style="margin: 2px;">
    <img alt="License" src="https://img.shields.io/badge/license-Apache%20License%202.0-blue" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>


# OPC-R1-8B

We introduce the Open Proof Corpus (OPC)—the world’s first large-scale, open-source dataset of human-verified solutions to advanced mathematics problems. With over 5,000 solutions spanning 1,000+ challenging problems, the OPC is specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. 

Leverage OPC to tackle pressing research questions in automated proof generation:
(1) How do natural language and formal proof generation compare? 
(2) How often do models that produce correct final answers truly reason their way to valid proofs? 
(3) By what margin do best-of-n selection methods improve proof quality?

Building on this breakthrough resource, we present **OPC-R1-8B** - an open-source model for proof correctness judging that matches state-of-the-art performance. OPC-R1-8B outperforms the majority of leading closed-source models, reaching an impressive 88.1% accuracy on verifying LLM-generated proofs.

## Introduction

**OPC-R1-8B** is a specialized, open-source model fine-tuned for proof evaluation. We finetuned *DeepSeek R1 Distill Qwen3 8B (05/28)* using GRPO reward signals from judgment accuracy, using the "LLM-as-judge" prompt detailed in our paper.

## Usage

You can run our model following the example below:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "INSAIT-Institute/OPC-R1-8B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input

problem = "Compute the number of real solutions of the equation $x^2 + 2x + 2 = 0$."
solution = "The equation can be factored in as $x^2 + 2x + 2 = (x+1)^2 + 1$. Because $(x+1)^2 \\geq 0$, then $x^2 + 2x + 2 \\geq 1 > 0$. Therefore, there are \\boxed{0} real solutions."

prompt = "<substitute the evaluation prompt template from the paper>"
messages = [
    {"role": "user", "content": prompt.format(problem=problem, solution=solution)}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

print("content:", content)
```

You can also run the model using the inference pipeline in our <a href="https://github.com/insait-institute/open-proof-corpus" style="margin: 2px;">GitHub repository</a>.

## Evaluation results

To benchmark LLMs as judges for proofs, we took 293 solutions from the OPC containing human-labeled ground truths. **OPC-R1-8B** achieves a 88.1% maj@5 accuracy, matching the top-performing **Gemini 2.5 Pro**, and outperforming all other closed- and open-source alternatives.


| **Judge**          | **pass@1** | **maj@5** | **Cost (USD)** |
|--------------------|:----------:|:---------:|:--------------:|
| Human              | 90.4       | -         | N/A            |
| Gemini 2.5 Pro     | 85.4       | 88.1      | 135.47         |
| **OPC-R1-8B**      | 83.8       | 88.1      | N/A            |
| o4-mini            | 83.8       | 85.3      | 29.57          |
| o3                 | 83.1       | 84.3      | 93.3           |
| Qwen3 235B-A22B    | 81.8       | 84.6      | 3.79           |
| DeepSeek R1 (05/28)| 80.9       | 82.6      | 27.35          |
| DeepSeek R1 Distill 8B| 70.7       | 71.3      | N/A            |
| Qwen3-8B           | 64.4       | 63.6      | N/A            |

## Dataset

We have open-sourced the OPC dataset, where we answer many open questions regarding the LLMs’ proof capabilities.

<div align="center">

|            **Dataset**            |                          **Download**                         |
| :-----------------------------: | :----------------------------------------------------------: |
|  OPC  | [🤗 HuggingFace](https://huggingface.co/datasets/INSAIT-Institute/OPC) |
</div>

## License

**OPC-R1-8B**  is released under the permissive Apache 2.0 license.

## Citation
```
@article{openproofcorpus2025,
  title={The Open Proof Corpus: A Large-Scale Human Study of LLM Proofs},
  author={Jasper Dekoninck and Ivo Petrov and  Kristian Minchev and Mislav Balunovic and Martin Vechev and Miroslav Marinov, Maria Drencheva and Lyuba Konova and Milen Milenov Shumanov and Kaloyan Tsvetkov and Nikolay Drenchev and Lazar D. Todorov and Kalina Nikolova and Nikolay Georgiev and Vanesa Kalinkova and Margulan Ismoldayev},
  journal={arXiv},
  year={2025},
}
```