|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- array/SAT |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- Qwen/Qwen2-VL-2B |
|
|
tags: |
|
|
- r1 |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
# VisualThinker-R1-Zero |
|
|
<!-- markdownlint-disable first-line-h1 --> |
|
|
<!-- markdownlint-disable html --> |
|
|
<!-- markdownlint-disable no-duplicate-header --> |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://multimodal-r1.s3.us-west-1.amazonaws.com/TurningPoint.png" width="20%" alt="TurningPoint" /> |
|
|
</div> |
|
|
<hr> |
|
|
<div align="center" style="line-height: 1;"> |
|
|
<a href="https://www.turningpoint-ai.com/" target="_blank" style="margin: 2px;"> |
|
|
<img alt="Homepage" src="https://img.shields.io/badge/🐳Homepage-TurningPointAI-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> |
|
|
<!-- <a href="https://chat.deepseek.com/" target="_blank" style="margin: 2px;"> |
|
|
<img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-DeepSeek%20R1-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> --> |
|
|
<a href="https://huggingface.co/turningpoint-ai" target="_blank" style="margin: 2px;"> |
|
|
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-TurningPoint%20AI-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> |
|
|
<!-- <a href="https://discord.gg/Tc7c45Zzu5" target="_blank" style="margin: 2px;"> |
|
|
<img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> --> |
|
|
<!-- <a href="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/qr.jpeg?raw=true" target="_blank" style="margin: 2px;"> |
|
|
<img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> --> |
|
|
<a href="https://twitter.com/deepseek_ai" target="_blank" style="margin: 2px;"> |
|
|
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-TurningPoint_AI-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> |
|
|
</div> |
|
|
|
|
|
<!-- <div align="center" style="line-height: 1;"> |
|
|
<a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/LICENSE" style="margin: 2px;"> |
|
|
<img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> |
|
|
</div> --> |
|
|
|
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/pdf/2503.05132"><b>Paper Link</b>👁️</a> |
|
|
</p> |
|
|
|
|
|
|
|
|
## 🚀 Introduction |
|
|
|
|
|
The recent DeepSeek-R1 demonstrated how reinforcement learning with simple |
|
|
rule-based reward can enable autonomous development of complex reasoning in |
|
|
large language models, characterized by the "aha moment", in which the model |
|
|
manifest self-reflection and increased response length during training. However, |
|
|
attempts to extend this success to multimodal reasoning often failed to reproduce |
|
|
these key characteristics. In this report, we present the first successful replication |
|
|
of these emergent characteristics for multimodal reasoning on only a non-SFT |
|
|
2B model. Starting with Qwen2-VL-2B and applying reinforcement learning |
|
|
directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, |
|
|
outperforming the base model by approximately ~30% and exceeding both SFT |
|
|
setting by ~2%. In addition, we share our failed attempts and insights in attempting |
|
|
to achieve R1-like reasoning using RL with instruct models, aiming to shed light on |
|
|
the challenges involved. Our key observations include: (1) applying RL on instruct |
|
|
model often results in trivial reasoning trajectories, and (2) naive length reward |
|
|
are ineffective in eliciting reasoning capabilities. The project code is available at |
|
|
https://github.com/turningpoint-ai/VisualThinker-R1-Zero |
|
|
|
|
|
<!-- **NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the [Usage Recommendation](#usage-recommendations) section.** |
|
|
|
|
|
<p align="center"> |
|
|
<img width="80%" src="figures/benchmark.jpg"> |
|
|
</p> --> |
|
|
|
|
|
## 🔮 Highlights |
|
|
1. We are the **first to successfully produce the emergent “aha moment” and increased response length** for multimodal reasoning on just a **non-SFT 2B model**. |
|
|
2. We showed that **vision-centric** tasks could also benefit from improved reasoning capabilities. |
|
|
|
|
|
Similar to DeepSeek R1, self reflection behavior is also observed during our RL training on vision-centric reasoning tasks. The model exhibits an emergent ability to rethink and correct its mistakes: |
|
|
|
|
|
``` |
|
|
. . . |
|
|
Therefore, dark brown wooden bed with white blanket is not above the doorway. |
|
|
But wait! I can think of something else. |
|
|
Maybe it's just higher than above the doorway, but slightly lower than above the doorway. |
|
|
. . . |
|
|
``` |
|
|
## ⚙️ Requirements and Installation |
|
|
* Python >= 3.10 |
|
|
* Pytorch == 2.0.1 |
|
|
* CUDA Version >= 11.7 |
|
|
* Install required packages: |
|
|
```bash |
|
|
# install transformers |
|
|
pip install git+https://github.com/huggingface/transformers |
|
|
# install qwen-vl utils |
|
|
pip install qwen-vl-utils |
|
|
``` |
|
|
|
|
|
## 💻 Model Downloads and Usage |
|
|
|
|
|
``` |
|
|
# Load model directly |
|
|
from transformers import AutoProcessor, AutoModelForImageTextToText |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero") |
|
|
model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero") |
|
|
|
|
|
# Prepare input |
|
|
|
|
|
``` |
|
|
|
|
|
## 📰 Evaluation Results |
|
|
|
|
|
### DeepSeek-R1-Evaluation |
|
|
For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1. |
|
|
<div align="center"> |
|
|
|
|
|
|
|
|
| Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 | |
|
|
|----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------| |
|
|
| | Architecture | - | - | MoE | - | - | MoE | |
|
|
| | # Activated Params | - | - | 37B | - | - | 37B | |
|
|
| | # Total Params | - | - | 671B | - | - | 671B | |
|
|
| English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | **91.8** | 90.8 | |
|
|
| | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | **92.9** | |
|
|
| | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | **84.0** | |
|
|
| | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | **92.2** | |
|
|
| | IF-Eval (Prompt Strict) | **86.5** | 84.3 | 86.1 | 84.8 | - | 83.3 | |
|
|
| | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | **75.7** | 71.5 | |
|
|
| | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | **47.0** | 30.1 | |
|
|
| | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | **82.5** | |
|
|
| | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | **87.6** | |
|
|
| | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | **92.3** | |
|
|
| Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | **65.9** | |
|
|
| | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | **96.6** | 96.3 | |
|
|
| | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | **2061** | 2029 | |
|
|
| | SWE Verified (Resolved) | **50.8** | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | |
|
|
| | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | **61.7** | 53.3 | |
|
|
| Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | **79.8** | |
|
|
| | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | **97.3** | |
|
|
| | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | **78.8** | |
|
|
| Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | **92.8** | |
|
|
| | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** | |
|
|
| | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 | |
|
|
|
|
|
</div> |
|
|
|
|
|
## 🙌 Stay Connected! |
|
|
|
|
|
We are always open to engaging discussions, collaborations, or even just sharing a virtual coffee. To get in touch or join our team, visit [TurningPoint AI](https://www.turningpoint-ai.com/)'s homepage for contact information. |
|
|
|
|
|
## 📖 Acknowledgements |
|
|
|
|
|
We sincerely thank [DeepSeek](https://github.com/deepseek-ai/DeepSeek-R1), [Open-R1](https://github.com/huggingface/open-r1), [QwenVL](https://github.com/QwenLM/Qwen2.5-VL), [Open-R1-Multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal), [R1-V](https://github.com/Deep-Agent/R1-V), [SAT](https://arxiv.org/abs/2412.07755), and [CV-Bench](https://cambrian-mllm.github.io/) for providing open source resources that laid the foundation of our project. |
|
|
|
|
|
## 🤝 Contributors |
|
|
|
|
|
Here are the key contributors from [TurningPoint AI](https://www.turningpoint-ai.com/) to this project: |
|
|
|
|
|
[Hengguang Zhou](https://hengguangzhou.github.io/)<sup>1</sup><sup>* </sup>, [Xirui Li](https://xirui-li.github.io/)<sup>1</sup><sup>* </sup>, [Ruochen Wang](https://ruocwang.github.io/)<sup>1</sup><sup>† </sup>, [Minhao Cheng](https://cmhcbb.github.io/)<sup>2</sup>, [Tianyi Zhou](https://tianyizhou.github.io/)<sup>3</sup> and [Cho-Jui Hsieh](https://web.cs.ucla.edu/~chohsieh/)<sup>1</sup><sup>4</sup> |
|
|
|
|
|
<sup>*</sup> Project Leads, <sup>†</sup> Main Advisor |
|
|
<sup>1</sup>University of California, Los Angeles, <sup>2</sup>Penn State University, <sup>3</sup>University of Maryland and <sup>4</sup>Google Research |
|
|
|
|
|
## ✏️ Citation |
|
|
``` |
|
|
@misc{zhou2025r1zerosahamomentvisual, |
|
|
title={R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model}, |
|
|
author={Hengguang Zhou and Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh}, |
|
|
year={2025}, |
|
|
eprint={2503.05132}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.AI}, |
|
|
url={https://arxiv.org/abs/2503.05132}, |
|
|
} |
|
|
|
|
|
``` |