Update README.md

Browse files

Files changed (1) hide show

README.md +148 -4

README.md CHANGED Viewed

@@ -1,13 +1,111 @@
-# Quickstart
-## Environment
 ```
 pip install vllm # vllm>=v0.8.5.post1 should work
 pip install transformers # transformers>=4.52.4 should work
 ```
-## Using vLLM to generate
 ```python
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
@@ -43,4 +141,50 @@ prompt = tokenizer.apply_chat_template(
 outputs = model.generate({"prompt": prompt}, sampling_params=sampling_params, use_tqdm=False)
 response = outputs[0].outputs[0].text
 print(response)
-```

+---
+language:
+- en
+library_name: transformers
+tags:
+- reasoning
+- reinforcement-learning
+- rlvr
+- mcts
+- math
+- iclr-2026
+license: apache-2.0
+datasets:
+- DeepMath-103K
+model-index:
+- name: DeepSearch-1.5B
+  results:
+  - task:
+      name: Mathematical Reasoning
+      type: text-generation
+    dataset:
+      name: AIME 2024
+      type: text
+    metrics:
+    - type: pass@1
+      value: 53.65
+  - task:
+      name: Mathematical Reasoning
+      type: text-generation
+    dataset:
+      name: AIME 2025
+      type: text
+    metrics:
+    - type: pass@1
+      value: 35.42
+  - task:
+      name: Mathematical Reasoning
+      type: text-generation
+    dataset:
+      name: AMC 2023
+      type: text
+    metrics:
+    - type: pass@1
+      value: 90.39
+  - task:
+      name: Mathematical Reasoning
+      type: text-generation
+    dataset:
+      name: MATH500
+      type: text
+    metrics:
+    - type: pass@1
+      value: 92.53
+  - task:
+      name: Mathematical Reasoning
+      type: text-generation
+    dataset:
+      name: Minerva
+      type: text
+    metrics:
+    - type: pass@1
+      value: 40.00
+  - task:
+      name: Mathematical Reasoning
+      type: text-generation
+    dataset:
+      name: Olympiad
+      type: text
+    metrics:
+    - type: pass@1
+      value: 65.72
+---
+<div align="center">
+<span style="font-family: default; font-size: 1.5em;">🚀 DeepSearch-1.5B</span>
+</div>
+**DeepSearch-1.5B🌟** is a 1.5B parameter reasoning model trained with **Reinforcement Learning with Verifiable Rewards (RLVR)**, enhanced by **Monte Carlo Tree Search (MCTS)**.
+Unlike prior approaches that restrict structured search to inference, DeepSearch integrates MCTS *into training*, enabling systematic exploration, fine-grained credit assignment, and efficient replay buffering.
+This model achieves **state-of-the-art accuracy among 1.5B reasoning models** while being **72× more compute-efficient** than extended RL training baselines.
+![Illstration of DeepSearch algorithm](./deepsearch.png)
+---
+## Model Details
+- **Developed by**: Fang Wu\*, Weihao Xuan\*, Heli Qi\*, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi
+- **Institutional affiliations**: Stanford University, University of Tokyo, RIKEN AIP, University of Washington, UC Berkeley, Amazon AWS, Columbia University
+- **Paper**: DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
+- **Base Model**: Nemotron-Research-Reasoning-Qwen-1.5B v2
+- **Parameters**: 1.5B
+- **Framework**: veRL
+- **License**: Apache-2.0
+---
+## Quickstart
+### Environment
 ```
 pip install vllm # vllm>=v0.8.5.post1 should work
 pip install transformers # transformers>=4.52.4 should work
 ```
+### Using vLLM to generate
 ```python
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
 outputs = model.generate({"prompt": prompt}, sampling_params=sampling_params, use_tqdm=False)
 response = outputs[0].outputs[0].text
 print(response)
+```
+## Performance
+| Benchmark | Nemotron-RR-Qwen-1.5B v2 | DeepSearch-1.5B |
+|-----------|--------------------------|-----------------|
+| AIME 2024 | 51.77 | **53.65** |
+| AIME 2025 | 32.92 | **35.42** |
+| AMC 2023  | 88.83 | **90.39** |
+| MATH500   | 92.24 | **92.53** |
+| Minerva   | 39.75 | **40.00** |
+| Olympiad  | 64.69 | **65.72** |
+| **Average** | 61.70 | **62.95** |
+DeepSearch improves average accuracy by **+1.25 points** over the best prior 1.5B model, while using **5.7× fewer GPU hours**.
+## Training
+- **Dataset**: DeepMath-103K (rigorously decontaminated)
+- **Training steps**: 100
+- **Search strategy**:
+  - Global Frontier Selection
+  - Entropy-based guidance
+  - Replay buffer with solution caching
+- **Hardware**: 16× NVIDIA H100 (96GB)
+- **Compute**: ~330 GPU hours
+---
+## Ethical Considerations
+- Positive: Reduces training costs and carbon footprint.
+- Risks: Systematic exploration methods could be adapted to sensitive domains (e.g., code synthesis).
+- Transparency: Full implementation and training details are released for reproducibility.
+---
+## Citation
+```bibtex
+@inproceedings{wu2026deepsearch,
+  title={DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search},
+  author={Fang Wu and Weihao Xuan and Heli Qi and Ximing Lu and Aaron Tu and Li Erran Li and Yejin Choi},
+  booktitle={arXiv},
+  year={2026}
+}