|
|
--- |
|
|
datasets: |
|
|
- DeepMath-103K |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- reasoning |
|
|
- reinforcement-learning |
|
|
- rlvr |
|
|
- mcts |
|
|
- math |
|
|
- iclr-2026 |
|
|
model-index: |
|
|
- name: DeepSearch-1.5B |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Mathematical Reasoning |
|
|
dataset: |
|
|
name: AIME 2024 |
|
|
type: text |
|
|
metrics: |
|
|
- type: avg@32 |
|
|
value: 53.65 |
|
|
- type: avg@32 |
|
|
value: 35.42 |
|
|
- type: avg@32 |
|
|
value: 90.39 |
|
|
- type: avg@32 |
|
|
value: 92.53 |
|
|
- type: avg@32 |
|
|
value: 40.0 |
|
|
- type: avg@32 |
|
|
value: 65.72 |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<span style="font-family: default; font-size: 1.5em;">🚀 DeepSearch-1.5B</span> |
|
|
</div> |
|
|
|
|
|
**DeepSearch-1.5B🌟** is a 1.5B parameter reasoning model trained with **Reinforcement Learning with Verifiable Rewards (RLVR)**, enhanced by **Monte Carlo Tree Search (MCTS)**. |
|
|
Unlike prior approaches that restrict structured search to inference, DeepSearch integrates MCTS *into training*, enabling systematic exploration, fine-grained credit assignment, and efficient replay buffering. |
|
|
|
|
|
This model achieves **state-of-the-art accuracy among 1.5B reasoning models** while being **5.7× more compute-efficient** than extended RL training baselines. |
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Developed by**: Fang Wu\*, Weihao Xuan\*, Heli Qi\*, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi |
|
|
- **Institutional affiliations**: Stanford University, University of Tokyo, RIKEN AIP, University of Washington, UC Berkeley, Amazon AWS, Columbia University |
|
|
- **Paper**: [DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search](https://huggingface.co/papers/2509.25454) |
|
|
- **Code**: [Github](https://github.com/smiles724/DeepSearch) |
|
|
- **Base Model**: Nemotron-Research-Reasoning-Qwen-1.5B v2 |
|
|
- **Parameters**: 1.5B |
|
|
- **Framework**: veRL |
|
|
- **License**: Apache-2.0 |
|
|
|
|
|
--- |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
### Environment |
|
|
``` |
|
|
pip install vllm # vllm>=v0.8.5.post1 should work |
|
|
pip install transformers # transformers>=4.52.4 should work |
|
|
``` |
|
|
|
|
|
|
|
|
### Using vLLM to generate |
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
|
|
|
def convert_question_to_messages(question: str): |
|
|
messages = [ |
|
|
{"role": "user", |
|
|
"content": question + " Let's think step by step and output the final answer within \\boxed{}. \ |
|
|
"} |
|
|
] |
|
|
return messages |
|
|
|
|
|
|
|
|
model_id="fangwu97/DeepSearch-1.5B" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
|
|
sampling_params = SamplingParams( |
|
|
temperature=0.6, |
|
|
top_p=0.95, |
|
|
max_tokens=32768 |
|
|
) |
|
|
|
|
|
model = LLM( |
|
|
model=model_id, |
|
|
tensor_parallel_size=1 |
|
|
) |
|
|
prompt = tokenizer.apply_chat_template( |
|
|
convert_question_to_messages("Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$."), |
|
|
add_generation_prompt=True, |
|
|
tokenize=False |
|
|
) |
|
|
|
|
|
outputs = model.generate({"prompt": prompt}, sampling_params=sampling_params, use_tqdm=False) |
|
|
response = outputs[0].outputs[0].text |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Benchmark | Nemotron-RR-Qwen-1.5B v2 | DeepSearch-1.5B | |
|
|
|-----------|--------------------------|-----------------| |
|
|
| AIME 2024 | 51.77 | **53.65** | |
|
|
| AIME 2025 | 32.92 | **35.42** | |
|
|
| AMC 2023 | 88.83 | **90.39** | |
|
|
| MATH500 | 92.24 | **92.53** | |
|
|
| Minerva | 39.75 | **40.00** | |
|
|
| Olympiad | 64.69 | **65.72** | |
|
|
| **Average** | 61.70 | **62.95** | |
|
|
|
|
|
DeepSearch improves average accuracy by **+1.25 points** over the best prior 1.5B model, while using **5.7× more GPU hours**. |
|
|
|
|
|
|
|
|
## Training |
|
|
|
|
|
- **Dataset**: DeepMath-103K (rigorously decontaminated) |
|
|
- **Training steps**: 100 |
|
|
- **Search strategy**: |
|
|
- Global Frontier Selection |
|
|
- Entropy-based guidance |
|
|
- Replay buffer with solution caching |
|
|
- **Hardware**: 16× NVIDIA H100 (96GB) |
|
|
- **Compute**: ~330 GPU hours |
|
|
|
|
|
--- |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- Positive: Reduces training costs and carbon footprint. |
|
|
- Risks: Systematic exploration methods could be adapted to sensitive domains (e.g., code synthesis). |
|
|
- Transparency: Full implementation and training details are released for reproducibility. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{wu2025deepsearch, |
|
|
title = {DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search}, |
|
|
author = {Wu, Fang and Xuan, Weihao and Qi, Heli and Lu, Ximing and Tu, Aaron and Li, Li Erran and Choi, Yejin}, |
|
|
year = {2025}, |
|
|
eprint = {2509.25454}, |
|
|
archivePrefix = {arXiv}, |
|
|
primaryClass = {cs.AI}, |
|
|
doi = {10.48550/arXiv.2509.25454}, |
|
|
} |
|
|
``` |