fangwu97 commited on
Commit
1c743b6
·
verified ·
1 Parent(s): 5fae3a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +148 -4
README.md CHANGED
@@ -1,13 +1,111 @@
1
- # Quickstart
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- ## Environment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ```
5
  pip install vllm # vllm>=v0.8.5.post1 should work
6
  pip install transformers # transformers>=4.52.4 should work
7
  ```
8
 
9
 
10
- ## Using vLLM to generate
11
  ```python
12
  from vllm import LLM, SamplingParams
13
  from transformers import AutoTokenizer
@@ -43,4 +141,50 @@ prompt = tokenizer.apply_chat_template(
43
  outputs = model.generate({"prompt": prompt}, sampling_params=sampling_params, use_tqdm=False)
44
  response = outputs[0].outputs[0].text
45
  print(response)
46
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ tags:
6
+ - reasoning
7
+ - reinforcement-learning
8
+ - rlvr
9
+ - mcts
10
+ - math
11
+ - iclr-2026
12
+ license: apache-2.0
13
+ datasets:
14
+ - DeepMath-103K
15
+ model-index:
16
+ - name: DeepSearch-1.5B
17
+ results:
18
+ - task:
19
+ name: Mathematical Reasoning
20
+ type: text-generation
21
+ dataset:
22
+ name: AIME 2024
23
+ type: text
24
+ metrics:
25
+ - type: pass@1
26
+ value: 53.65
27
+ - task:
28
+ name: Mathematical Reasoning
29
+ type: text-generation
30
+ dataset:
31
+ name: AIME 2025
32
+ type: text
33
+ metrics:
34
+ - type: pass@1
35
+ value: 35.42
36
+ - task:
37
+ name: Mathematical Reasoning
38
+ type: text-generation
39
+ dataset:
40
+ name: AMC 2023
41
+ type: text
42
+ metrics:
43
+ - type: pass@1
44
+ value: 90.39
45
+ - task:
46
+ name: Mathematical Reasoning
47
+ type: text-generation
48
+ dataset:
49
+ name: MATH500
50
+ type: text
51
+ metrics:
52
+ - type: pass@1
53
+ value: 92.53
54
+ - task:
55
+ name: Mathematical Reasoning
56
+ type: text-generation
57
+ dataset:
58
+ name: Minerva
59
+ type: text
60
+ metrics:
61
+ - type: pass@1
62
+ value: 40.00
63
+ - task:
64
+ name: Mathematical Reasoning
65
+ type: text-generation
66
+ dataset:
67
+ name: Olympiad
68
+ type: text
69
+ metrics:
70
+ - type: pass@1
71
+ value: 65.72
72
+ ---
73
+ <div align="center">
74
+ <span style="font-family: default; font-size: 1.5em;">🚀 DeepSearch-1.5B</span>
75
+ </div>
76
 
77
+ **DeepSearch-1.5B🌟** is a 1.5B parameter reasoning model trained with **Reinforcement Learning with Verifiable Rewards (RLVR)**, enhanced by **Monte Carlo Tree Search (MCTS)**.
78
+ Unlike prior approaches that restrict structured search to inference, DeepSearch integrates MCTS *into training*, enabling systematic exploration, fine-grained credit assignment, and efficient replay buffering.
79
+
80
+ This model achieves **state-of-the-art accuracy among 1.5B reasoning models** while being **72× more compute-efficient** than extended RL training baselines.
81
+
82
+ ![Illstration of DeepSearch algorithm](./deepsearch.png)
83
+
84
+
85
+ ---
86
+
87
+ ## Model Details
88
+
89
+ - **Developed by**: Fang Wu\*, Weihao Xuan\*, Heli Qi\*, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi
90
+ - **Institutional affiliations**: Stanford University, University of Tokyo, RIKEN AIP, University of Washington, UC Berkeley, Amazon AWS, Columbia University
91
+ - **Paper**: DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
92
+ - **Base Model**: Nemotron-Research-Reasoning-Qwen-1.5B v2
93
+ - **Parameters**: 1.5B
94
+ - **Framework**: veRL
95
+ - **License**: Apache-2.0
96
+
97
+ ---
98
+
99
+ ## Quickstart
100
+
101
+ ### Environment
102
  ```
103
  pip install vllm # vllm>=v0.8.5.post1 should work
104
  pip install transformers # transformers>=4.52.4 should work
105
  ```
106
 
107
 
108
+ ### Using vLLM to generate
109
  ```python
110
  from vllm import LLM, SamplingParams
111
  from transformers import AutoTokenizer
 
141
  outputs = model.generate({"prompt": prompt}, sampling_params=sampling_params, use_tqdm=False)
142
  response = outputs[0].outputs[0].text
143
  print(response)
144
+ ```
145
+
146
+ ## Performance
147
+
148
+ | Benchmark | Nemotron-RR-Qwen-1.5B v2 | DeepSearch-1.5B |
149
+ |-----------|--------------------------|-----------------|
150
+ | AIME 2024 | 51.77 | **53.65** |
151
+ | AIME 2025 | 32.92 | **35.42** |
152
+ | AMC 2023 | 88.83 | **90.39** |
153
+ | MATH500 | 92.24 | **92.53** |
154
+ | Minerva | 39.75 | **40.00** |
155
+ | Olympiad | 64.69 | **65.72** |
156
+ | **Average** | 61.70 | **62.95** |
157
+
158
+ DeepSearch improves average accuracy by **+1.25 points** over the best prior 1.5B model, while using **5.7× fewer GPU hours**.
159
+
160
+
161
+ ## Training
162
+
163
+ - **Dataset**: DeepMath-103K (rigorously decontaminated)
164
+ - **Training steps**: 100
165
+ - **Search strategy**:
166
+ - Global Frontier Selection
167
+ - Entropy-based guidance
168
+ - Replay buffer with solution caching
169
+ - **Hardware**: 16× NVIDIA H100 (96GB)
170
+ - **Compute**: ~330 GPU hours
171
+
172
+ ---
173
+
174
+ ## Ethical Considerations
175
+
176
+ - Positive: Reduces training costs and carbon footprint.
177
+ - Risks: Systematic exploration methods could be adapted to sensitive domains (e.g., code synthesis).
178
+ - Transparency: Full implementation and training details are released for reproducibility.
179
+
180
+ ---
181
+
182
+ ## Citation
183
+
184
+ ```bibtex
185
+ @inproceedings{wu2026deepsearch,
186
+ title={DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search},
187
+ author={Fang Wu and Weihao Xuan and Heli Qi and Ximing Lu and Aaron Tu and Li Erran Li and Yejin Choi},
188
+ booktitle={arXiv},
189
+ year={2026}
190
+ }