File size: 2,676 Bytes
666f574
 
405c23d
 
 
666f574
 
405c23d
666f574
405c23d
666f574
405c23d
 
 
666f574
405c23d
666f574
405c23d
 
 
 
 
666f574
405c23d
666f574
405c23d
666f574
405c23d
666f574
405c23d
 
666f574
 
405c23d
 
 
 
 
 
666f574
405c23d
 
 
 
 
 
 
666f574
405c23d
666f574
405c23d
 
 
 
 
 
666f574
405c23d
 
 
 
 
 
666f574
405c23d
 
 
 
666f574
405c23d
 
 
 
 
666f574
405c23d
 
666f574
405c23d
666f574
405c23d
9830f56
 
 
 
 
405c23d
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
library_name: transformers
license: apache-2.0
base_model:
- Qwen/Qwen2.5-Math-7B
---

# Qwen2.5-Math-7B-Oat-Zero

## Links

- 📜 [Paper](https://github.com/sail-sg/understand-r1-zero/blob/main/understand-r1-zero.pdf)
- 💻 [GitHub](https://github.com/sail-sg/understand-r1-zero)
- 🤗 [Oat-Zero Collection](https://huggingface.co/collections/sail/oat-zero-understanding-r1-zero-like-training-67dcdb07b9f3eb05f1501c4a)

## Introduction

This model is trained by the minimalist R1-Zero recipe introduced in our paper: 
- **Algorithm**: Dr. DRPO
- **Data**: level 3-5 questions from MATH dataset
- **Base model**: [Qwen/Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B)
- **Template**: Qwen-Math

Evaluation results on widely used math benchmarks are shown below:

<img src="https://raw.githubusercontent.com/sail-sg/understand-r1-zero/refs/heads/main/assets/benchmark_table.png" width=100%/>

## Usage

```python
import vllm


def apply_qwen_math_template(question: str):
    return (
        "<|im_start|>system\nPlease reason step by step, and put your final answer within \\boxed{}.<|im_end|>\n<|im_start|>user\n"
        + question
        + "<|im_end|>\n<|im_start|>assistant\n"
    )

def apply_r1_template(question: str):
    return (
        "A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. "
        "The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.\nUser: "
        + question
        + "\nAssistant: <think>"
    )

model_name = "sail/Qwen2.5-Math-7B-Oat-Zero"

sampling_params = vllm.SamplingParams(
    n=1,
    temperature=0,
    top_p=1,
    max_tokens=3000,
)

model = vllm.LLM(
    model_name,
    max_model_len=4096,
    dtype="bfloat16",
    enable_prefix_caching=True,
)

if "Llama-3.2-3B-Oat-Zero" in model_name:
    apply_template = apply_r1_template
else:
    apply_template = apply_qwen_math_template

prompts = [
    "How many positive whole-number divisors does 196 have?"
]
prompts = list(map(apply_template, prompts))
outputs = model.generate(prompts, sampling_params)

print(outputs)
```

## Citation

```latex
@article{liu2025understanding,
  title={Understanding r1-zero-like training: A critical perspective},
  author={Liu, Zichen and Chen, Changyu and Li, Wenjun and Qi, Penghui and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min},
  journal={arXiv preprint arXiv:2503.20783},
  year={2025}
}
```