Improve model card: add pipeline tag, library name, and comprehensive details

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +75 -11
README.md CHANGED
@@ -1,17 +1,23 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
 
 
 
 
 
7
  ---
8
 
9
-
10
  ![title](./assets/title.png)
11
 
 
 
12
  <div align="center">
 
 
 
13
  <a href="https://github.com/linkangheng/Open-Vision-Reasoner/tree/main/paper/Open-Vision-Reasoner.pdf" target="blank" style="margin-right: 10px;">
14
- <img alt="arXiv" src="https://img.shields.io/badge/arXiv-OVR-red?logo=arxiv" height="20" />
15
  </a><a href="https://huggingface.co/collections/Kangheng/ovr-686646849f9b43daccbe2fe0" target="blank" style="margin-right: 10px;">
16
  <img alt="HF Model: OVR" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Model-OVR-fb8740?&logoColor=white" height="20" />
17
  </a><a href="" target="blank" style="margin-right: 10px;">
@@ -22,7 +28,7 @@ base_model:
22
  ### Overview
23
  ![preview](./assets/preview.png)
24
 
25
- The remarkable reasoning capbility of Large Language Models (LLMs) stems from cognitive behaviors that emerge when reinforcing against verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock **advanced visual reasoning**.
26
 
27
  We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive **text-only cold-start fine-tuning**, followed by **multimodal reinforcement learning** (RL) spanning nearly 1,000 steps—surpassing all prior open-source efforts in scale. This pioneering work reveals three fundamental insights:
28
 
@@ -32,16 +38,53 @@ We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive **text-only
32
 
33
  Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including **95.3%** on MATH500, **51.8%** on MathVision and **54.6%** on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.
34
 
35
- ### Model Card
 
 
36
 
37
  | **Model** | **Description** | **Download** |
38
  |:---------:|:---------------:|:------------:|
39
  | OVR-7B-ColdStart | Intermediate model after massive language-only cold-start fine-tuning | [🤗 OVR-7B-ColdStart](https://huggingface.co/Kangheng/OVR-7B-ColdStart) |
40
  | OVR-7B-RL | Final model after large-scale multimodal RL training | [🤗 OVR-7B-RL](https://huggingface.co/Kangheng/OVR-7B-RL) |
41
 
42
- ### Performance
43
- ![performance-text](./assets/performance-text.png)
44
- ![performance-vision](./assets/performance-vision.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ### Training Dynamics and Performance Evolution
47
 
@@ -61,7 +104,28 @@ Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art perfo
61
  <img width="100%" src="assets/performance.png">
62
  </p>
63
 
64
- ### Model Deployment
65
  ```bash
66
  vllm serve Kangheng/OVR-7B-ColdStart --port 8000 --host 0.0.0.0 --tensor-parallel-size 1 --gpu-memory-utilization 0.6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ```
 
1
  ---
 
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
+ language:
5
+ - en
6
+ license: apache-2.0
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
  ---
10
 
 
11
  ![title](./assets/title.png)
12
 
13
+ [**Paper: Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning**](https://huggingface.co/papers/2507.05255) | [**Project Page**](https://weiyana.github.io/Open-Vision-Reasoner/) | [**Codebase**](https://github.com/linkangheng/Open-Vision-Reasoner)
14
+
15
  <div align="center">
16
+ <a href="https://weiyana.github.io/Open-Vision-Reasoner/" target="_blank" style="margin-right: 10px;">
17
+ <img alt="Project Page" src="https://img.shields.io/badge/Project%20Page-Open--Vision--Reasoner-blue?logo=book" height="20" />
18
+ </a>
19
  <a href="https://github.com/linkangheng/Open-Vision-Reasoner/tree/main/paper/Open-Vision-Reasoner.pdf" target="blank" style="margin-right: 10px;">
20
+ <img alt="arXiv" src="https://img.shields.io/badge/Paper-OVR-red?logo=arxiv" height="20" />
21
  </a><a href="https://huggingface.co/collections/Kangheng/ovr-686646849f9b43daccbe2fe0" target="blank" style="margin-right: 10px;">
22
  <img alt="HF Model: OVR" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Model-OVR-fb8740?&logoColor=white" height="20" />
23
  </a><a href="" target="blank" style="margin-right: 10px;">
 
28
  ### Overview
29
  ![preview](./assets/preview.png)
30
 
31
+ The remarkable reasoning capability of Large Language Models (LLMs) stems from cognitive behaviors that emerge when reinforcing against verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock **advanced visual reasoning**.
32
 
33
  We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive **text-only cold-start fine-tuning**, followed by **multimodal reinforcement learning** (RL) spanning nearly 1,000 steps—surpassing all prior open-source efforts in scale. This pioneering work reveals three fundamental insights:
34
 
 
38
 
39
  Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including **95.3%** on MATH500, **51.8%** on MathVision and **54.6%** on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.
40
 
41
+ ### Model Release
42
+
43
+ > Models are available at Huggingface Collections: [Open-Vision-Reasoner](https://huggingface.co/collections/Kangheng/ovr-686646849f9b43daccbe2fe0). We release the cold-start model and the final RL-tuned OVR model to facilitate further research.
44
 
45
  | **Model** | **Description** | **Download** |
46
  |:---------:|:---------------:|:------------:|
47
  | OVR-7B-ColdStart | Intermediate model after massive language-only cold-start fine-tuning | [🤗 OVR-7B-ColdStart](https://huggingface.co/Kangheng/OVR-7B-ColdStart) |
48
  | OVR-7B-RL | Final model after large-scale multimodal RL training | [🤗 OVR-7B-RL](https://huggingface.co/Kangheng/OVR-7B-RL) |
49
 
50
+ ### Performance Results
51
+
52
+ ### **Language Reasoning**
53
+
54
+ Our model demonstrates exceptional language reasoning capabilities. On the challenging AIME 2024 and 2025 benchmarks, it dramatically surpasses other 7B open-source models by an average of over 10%, achieving performance comparable to leading 32B models. This superiority extends to general reasoning tasks, with significant gains of +4.6% on MMLU and +10.4% on MMLU-Pro over parameter-matched competitors. These results highlight the effectiveness of our curated, high-quality cold-start training data.
55
+ <p align="center">
56
+ <img width="95%" src="assets/language_benchmarks.png">
57
+ </p>
58
+
59
+ ### **Visual Reasoning**
60
+
61
+ OVR represents a significant breakthrough for 7B-scale models in visual reasoning.
62
+ It is the first post-trained Qwen2.5-VL-7B model to surpass the 50% threshold on MathVision, while also achieving state-of-the-art performance among 7B models on DynaMath and MathVerse.
63
+ This strong overall performance is further underscored by a substantial gain on MMMU-Pro (+7.2%) over previous methods.
64
+ These results demonstrate that reasoning capabilities acquired through language training can effectively transfer to multimodal tasks, resulting in notable improvements in visual reasoning performance.
65
+
66
+ <p align="center">
67
+ <img width="95%" src="assets/visual_benchmarks.png">
68
+ </p>
69
+
70
+ ### **Cognitive Behavior Analysis**
71
+
72
+ Our analysis centers on the four pivotal visual cognitive behaviors. We systematically investigate how these patterns are transferred to visual modality from their linguistic counterparts.
73
+
74
+ <p align="center">
75
+ <img width="45%" src="assets/transfer-a.png">
76
+ <img width="46%" src="assets/transfer-b.png">
77
+ </p>
78
+
79
+ ### Training Pipeline
80
+
81
+ To facilitate efficient cognitive development and cross-modal generalization, we employ the popular "RL with a cold start" paradigm with two sequential training stages:
82
+
83
+ **Stage 1: Linguistic Cold Start**
84
+ The LLM module is supervised fine-tuned on language-only reasoning datasets distilled from DeepSeek-R1, establishing core cognitive behaviors such as backtracking and subgoal decomposition within a purely linguistic setting.
85
+
86
+ **Stage 2: Multimodal RL**
87
+ We apply reinforcement learning with Open-Reasoner-Zero setting on both text and multimodal tasks using verifiable match rewards. This promotes reasoning generalization and aligns previously learned cognitive patterns with visual contexts, enabling effective cross-modal transfer.
88
 
89
  ### Training Dynamics and Performance Evolution
90
 
 
104
  <img width="100%" src="assets/performance.png">
105
  </p>
106
 
107
+ ### Model Deployment
108
  ```bash
109
  vllm serve Kangheng/OVR-7B-ColdStart --port 8000 --host 0.0.0.0 --tensor-parallel-size 1 --gpu-memory-utilization 0.6
110
+ ```
111
+
112
+ ### Roadmap
113
+
114
+ - [x] `2025-07-04` 🎄: Initial release of OVR models, and research paper.
115
+ - [ ] 📊: Release of OVR training data.
116
+ - [ ] 🚀: Continuously iterate on models and data to release more powerful versions of OVR. Stay tuned!
117
+
118
+ ### Citation
119
+
120
+ If you find our work useful for your research, please consider citing our paper:
121
+ ```bibtex
122
+ @misc{wei2025openvisionreasonertransferring,
123
+ title={Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning},
124
+ author={Yana Wei and Liang Zhao and Jianjian Sun and Kangheng Lin and Jisheng Yin and Jingcheng Hu and Yinmin Zhang and En Yu and Haoran Lv and Zejia Weng and Jia Wang and Chunrui Han and Yuang Peng and Qi Han and Zheng Ge and Xiangyu Zhang and Daxin Jiang and Vishal M. Patel},
125
+ year={2025},
126
+ eprint={2507.05255},
127
+ archivePrefix={arXiv},
128
+ primaryClass={cs.CV},
129
+ url={https://arxiv.org/abs/2507.05255},
130
+ }
131
  ```