Update README.md
Browse files
README.md
CHANGED
|
@@ -41,27 +41,6 @@ Building on this breakthrough resource, we present **OPC-R1-8B** - an open-sourc
|
|
| 41 |
|
| 42 |
**OPC-R1-8B** is a specialized, open-source model fine-tuned for proof evaluation. We finetuned *DeepSeek R1 Distill Qwen3 8B (05/28)* using GRPO reward signals from judgment accuracy, using the "LLM-as-judge" prompt detailed in our paper.
|
| 43 |
|
| 44 |
-
## Evaluation results
|
| 45 |
-
|
| 46 |
-
To benchmark LLMs as judges for proofs, we took 293 solutions from the OPC, containing human-labeled ground truths. **OPC-R1-8B** achieves a notable 88.1% maj@5 accuracy, matching the top-performing **Gemini 2.5 Pro**, and outperforming all other closed- and open-source alternatives by a clear margin. Notably, OPC-R1-8B outpaces its base model by a full *17%*, highlighting the downstream impact of the high-quality data in the OPC
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
| **Judge** | **pass@1** | **maj@5** | **Cost (USD)** |
|
| 50 |
-
|--------------------|:----------:|:---------:|:--------------:|
|
| 51 |
-
| Human | 90.4 | - | N/A |
|
| 52 |
-
| Gemini 2.5 Pro | 85.4 | 88.1 | 135.47 |
|
| 53 |
-
| **OPC-R1-8B** | 83.8 | 88.1 | N/A |
|
| 54 |
-
| o4-mini | 83.8 | 85.3 | 29.57 |
|
| 55 |
-
| o3 | 83.1 | 84.3 | 93.3 |
|
| 56 |
-
| Gemini 2.5 Flash | 82.7 | 86.0 | 86.95 |
|
| 57 |
-
| Qwen3 235B-A22B | 81.8 | 84.6 | 3.79 |
|
| 58 |
-
| DeepSeek R1 (05/28)| 80.9 | 82.6 | 27.35 |
|
| 59 |
-
| Qwen3 30B-A3B | 74.0 | 75.4 | N/A |
|
| 60 |
-
| DeepSeek R1 Distill 8B| 70.7 | 71.3 | N/A |
|
| 61 |
-
| Claude 4.0 Sonnet | 70.6 | 75.0 | 28.21 |
|
| 62 |
-
| Qwen3-8B | 64.4 | 63.6 | N/A |
|
| 63 |
-
| GPT-4.1 | 61.4 | 60.8 | 20.33 |
|
| 64 |
-
|
| 65 |
## Usage
|
| 66 |
|
| 67 |
You can run our model following the example below:
|
|
@@ -92,7 +71,7 @@ text = tokenizer.apply_chat_template(
|
|
| 92 |
messages,
|
| 93 |
tokenize=False,
|
| 94 |
add_generation_prompt=True,
|
| 95 |
-
enable_thinking=True
|
| 96 |
)
|
| 97 |
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 98 |
|
|
@@ -110,6 +89,23 @@ print("content:", content)
|
|
| 110 |
|
| 111 |
You can also run the model using the inference pipeline in our <a href="https://github.com/insait-institute/open-proof-corpus" style="margin: 2px;">GitHub repository</a>.
|
| 112 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
## Dataset
|
| 114 |
|
| 115 |
We have open-sourced the OPC dataset, where we answer many open questions regarding the LLMs’ proof capabilities.
|
|
@@ -123,8 +119,14 @@ We have open-sourced the OPC dataset, where we answer many open questions regard
|
|
| 123 |
|
| 124 |
## License
|
| 125 |
|
| 126 |
-
**OPC-R1-8B** is released under the permissive Apache 2.0 license.
|
| 127 |
|
| 128 |
## Citation
|
| 129 |
-
|
| 130 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
**OPC-R1-8B** is a specialized, open-source model fine-tuned for proof evaluation. We finetuned *DeepSeek R1 Distill Qwen3 8B (05/28)* using GRPO reward signals from judgment accuracy, using the "LLM-as-judge" prompt detailed in our paper.
|
| 43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
## Usage
|
| 45 |
|
| 46 |
You can run our model following the example below:
|
|
|
|
| 71 |
messages,
|
| 72 |
tokenize=False,
|
| 73 |
add_generation_prompt=True,
|
| 74 |
+
enable_thinking=True
|
| 75 |
)
|
| 76 |
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 77 |
|
|
|
|
| 89 |
|
| 90 |
You can also run the model using the inference pipeline in our <a href="https://github.com/insait-institute/open-proof-corpus" style="margin: 2px;">GitHub repository</a>.
|
| 91 |
|
| 92 |
+
## Evaluation results
|
| 93 |
+
|
| 94 |
+
To benchmark LLMs as judges for proofs, we took 293 solutions from the OPC containing human-labeled ground truths. **OPC-R1-8B** achieves a 88.1% maj@5 accuracy, matching the top-performing **Gemini 2.5 Pro**, and outperforming all other closed- and open-source alternatives.
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
| **Judge** | **pass@1** | **maj@5** | **Cost (USD)** |
|
| 98 |
+
|--------------------|:----------:|:---------:|:--------------:|
|
| 99 |
+
| Human | 90.4 | - | N/A |
|
| 100 |
+
| Gemini 2.5 Pro | 85.4 | 88.1 | 135.47 |
|
| 101 |
+
| **OPC-R1-8B** | 83.8 | 88.1 | N/A |
|
| 102 |
+
| o4-mini | 83.8 | 85.3 | 29.57 |
|
| 103 |
+
| o3 | 83.1 | 84.3 | 93.3 |
|
| 104 |
+
| Qwen3 235B-A22B | 81.8 | 84.6 | 3.79 |
|
| 105 |
+
| DeepSeek R1 (05/28)| 80.9 | 82.6 | 27.35 |
|
| 106 |
+
| DeepSeek R1 Distill 8B| 70.7 | 71.3 | N/A |
|
| 107 |
+
| Qwen3-8B | 64.4 | 63.6 | N/A |
|
| 108 |
+
|
| 109 |
## Dataset
|
| 110 |
|
| 111 |
We have open-sourced the OPC dataset, where we answer many open questions regarding the LLMs’ proof capabilities.
|
|
|
|
| 119 |
|
| 120 |
## License
|
| 121 |
|
| 122 |
+
**OPC-R1-8B** is released under the permissive Apache 2.0 license.
|
| 123 |
|
| 124 |
## Citation
|
| 125 |
+
```
|
| 126 |
+
@article{openproofcorpus2025,
|
| 127 |
+
title={The Open Proof Corpus: A Large-Scale Human Study of LLM Proofs},
|
| 128 |
+
author={Jasper Dekoninck and Ivo Petrov and Kristian Minchev and Miroslav Marinov, Maria Drencheva and Lyuba Konova and Milen Milenov Shumanov and Kaloyan Tsvetkov and Nikolay Drenchev and Lazar D. Todorov and Kalina Nikolova and Nikolay Georgiev and Vanesa Kalinkova and Margulan Ismoldayev and Mislav Balunovic and Martin Vechev},
|
| 129 |
+
journal={arXiv},
|
| 130 |
+
year={2025},
|
| 131 |
+
}
|
| 132 |
+
```
|