Update README.md

2e034db verified 10 months ago

6.47 kB

license: apache-2.0
language:
  - en
base_model:
  - deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

OPC-R1-8B

We introduce the Open Proof Corpus (OPC)—the world’s first large-scale, open-source dataset of human-verified solutions to advanced mathematics problems. With over 5,000 solutions spanning 1,000+ challenging problems, the OPC is specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO.

Leverage OPC to tackle pressing research questions in automated proof generation: (1) How do natural language and formal proof generation compare? (2) How often do models that produce correct final answers truly reason their way to valid proofs? (3) By what margin do best-of-n selection methods improveme proof quality?

Building on this breakthrough resource, we present OPC-R1-8B - an open-source model for proof correctness judging that matches state-of-the-art performance. OPC-R1-8B ouperforms the majority of leading closed-source models, reaching an impressive 88.1% accuracy on verifying LLM-generated proofs.

Introduction

OPC-R1-8B is a specialized, open-source model fine-tuned for proof evaluation. We finetuned DeepSeek R1 Distill Qwen3 8B (05/28) using GRPO reward signals from judgment accuracy, using the "LLM-as-judge" prompt detailed in our paper.

Evaluation results

To benchmark LLMs as judges for proofs, we took 293 solutions from the OPC, containing human-labeled ground truths. OPC-R1-8B achieves a notable 88.1% maj@5 accuracy, matching the top-performing Gemini 2.5 Pro, and outperforming all other closed- and open-source alternatives by a clear margin. Notably, OPC-R1-8B outpaces its base model by a full 17%, highlighting the downstream impact of the high-quality data in the OPC

Judge	pass@1	maj@5	Cost (USD)
Human	90.4	-	N/A
Gemini 2.5 Pro	85.4	88.1	135.47
OPC-R1-8B	83.8	88.1	N/A
o4-mini	83.8	85.3	29.57
o3	83.1	84.3	93.3
Gemini 2.5 Flash	82.7	86.0	86.95
Qwen3 235B-A22B	81.8	84.6	3.79
DeepSeek R1 (05/28)	80.9	82.6	27.35
Qwen3 30B-A3B	74.0	75.4	N/A
DeepSeek R1 Distill 8B	70.7	71.3	N/A
Claude 4.0 Sonnet	70.6	75.0	28.21
Qwen3-8B	64.4	63.6	N/A
GPT-4.1	61.4	60.8	20.33

Usage

You can run our model following the example below:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "INSAIT-Institute/OPC-R1-8B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input

problem = "Compute the number of real solutions of the equation $x^2 + 2x + 2 = 0$."
solution = "The equation can be factored in as $x^2 + 2x + 2 = (x+1)^2 + 1$. Because $(x+1)^2 \\geq 0$, then $x^2 + 2x + 2 \\geq 1 > 0$. Therefore, there are \\boxed{0} real solutions."

prompt = "<substitute the evaluation prompt template from the paper>"
messages = [
    {"role": "user", "content": prompt.format(problem=problem, solution=solution)}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

print("content:", content)

You can also run the model using the inference pipeline in our GitHub repository.

Dataset

We have open-sourced the OPC dataset, where we answer many open questions regarding the LLMs’ proof capabilities.

Dataset	Download
OPC	🤗 HuggingFace

License

OPC-R1-8B is released under the permissive Apache 2.0 license. Please review and respect the license when using the model.

Citation

TODO