shawn0wang commited on
Commit
c4211fa
·
verified ·
1 Parent(s): 119b0db

Delete README_ZH.md

Browse files
Files changed (1) hide show
  1. README_ZH.md +0 -293
README_ZH.md DELETED
@@ -1,293 +0,0 @@
1
- # Skywork-R1V
2
-
3
- <div align="center">
4
- <img src="logo.jpeg" alt="Introduction Image" width="400" height="400">
5
- </div>
6
-
7
-
8
- ## 1. 介绍
9
-
10
- 我们推出Skywork-R1V,一种多模态推理模型,通过近乎无损的迁移方法,将R1系列文本模型扩展到视觉模态。Skywork-R1V采用轻量级视觉投影器,无需重新训练基础语言模型或视觉编码器,即可实现无缝的多模态适配。为提升视觉-文本对齐,我们开发了结合迭代监督微调(SFT)与组相对策略优化(GRPO)的混合优化策略,显著提高了跨模态融合能力。此外,我们创造了一种自适应长度的思维链(Chain-of-Thought)蒸馏方法用于生成推理数据,动态优化推理链长度以提高推理效率并避免过度推理。该模型在重要多模态推理基准测试中达到最先进水平,在MMMU上得分69.0,在MathVista上得分67.5,可与领先的闭源模型(如Gemini 2.0和Kimi-k1.5)媲美。同时,它还保持了出色的文本推理能力,在AIME达到72.6分,在MATH500达到94.3分。
11
-
12
- ## 2. 模型概述
13
-
14
- **架构:**
15
-
16
- Skywork-R1V采用模块化架构,有效结合视觉和语言能力:
17
- - **视觉编码器:** 使用视觉Transformer (ViT)作为视觉主干处理图像输入。
18
- - **视觉投影器:** 轻量级MLP适配器,作为视觉与语言组件间的桥梁。
19
- - **语言模型:** 采用R1-distilled-Qwen-32B作为具备推理能力的语言模型主干。
20
-
21
- 模型连接模式为视觉编码器 → MLP适配器 → 语言模型,其中MLP适配器将视觉编码器的输出空间与语言模型的输入空间对齐。这种设计可高效地将文本的推理能力迁移到多模态领域,无需大规模重新训练视觉编码器或语言模型。
22
-
23
- **关键设计**
24
- - **先进的多模态推理**
25
- 擅长跨文本和视觉模态的复杂推理。
26
- - **迭代训练策略**
27
- 采用迭代监督和GRPO优化模型对齐和性能。
28
- - **自适应长度思维链**
29
- 动态调整推理长度以增强推理效率和准确性。
30
- - **可扩展性能**
31
- 在数学、编程和多模态任务上性能媲美专有模型。
32
-
33
- ## 3. 评估
34
-
35
- <div align="center">
36
- <img src="eval.jpeg" width="600" height="200" alt="skywork_r1v_eval" />
37
- </div>
38
-
39
- <div align="center">
40
- <b>Evaluation results of LLMs and VLMs</b>
41
- </div>
42
- <table>
43
- <thead>
44
- <tr>
45
- <th></th>
46
- <th align="center"><strong>Vision</strong></th>
47
- <th align="center" colspan="3"><strong>Reasoning</strong></th>
48
- <th align="center" colspan="3"><strong>Vision</strong></th>
49
- </tr>
50
- <tr>
51
- <th></th>
52
- <th></th>
53
- <th align="center"><strong>MATH-500</strong></th>
54
- <th align="center"><strong>AIME 2024</strong></th>
55
- <th align="center"><strong>GPQA</strong></th>
56
- <th align="center"><strong>MathVista(mini)</strong></th>
57
- <th align="center"><strong>MMMU(Val)</strong></th>
58
- </tr>
59
- <tr>
60
- <th></th>
61
- <th></th>
62
- <th align="center">pass@1</th>
63
- <th align="center">pass@1</th>
64
- <th align="center">pass@1</th>
65
- <th align="center">pass@1</th>
66
- <th align="center">pass@1</th>
67
- </tr>
68
- </thead>
69
- <tbody>
70
- <tr>
71
- <td>Qwen2.5-72B-Instruct</td>
72
- <td align="center">❌</td>
73
- <td align="center">82.6</td>
74
- <td align="center">23.3</td>
75
- <td align="center">49.0</td>
76
- <td align="center">-</td>
77
- <td align="center">-</td>
78
- </tr>
79
- <tr>
80
- <td>Deepseek V3</td>
81
- <td align="center">❌</td>
82
- <td align="center">90.2</td>
83
- <td align="center">39.2</td>
84
- <td align="center">59.1</td>
85
- <td align="center">-</td>
86
- <td align="center">-</td>
87
- </tr>
88
- <tr>
89
- <td>Deepseek R1</td>
90
- <td align="center">❌</td>
91
- <td align="center">97.3</td>
92
- <td align="center">79.8</td>
93
- <td align="center">71.5</td>
94
- <td align="center">-</td>
95
- <td align="center">-</td>
96
- </tr>
97
- <tr>
98
- <td>Claude 3.5 Sonnet</td>
99
- <td align="center">✅</td>
100
- <td align="center">78.3</td>
101
- <td align="center">16.0</td>
102
- <td align="center">65.0</td>
103
- <td align="center">67.7</td>
104
- <td align="center">68.3</td>
105
- </tr>
106
- <tr>
107
- <td>GPT-4o</td>
108
- <td align="center">✅</td>
109
- <td align="center">76.6</td>
110
- <td align="center">9.3</td>
111
- <td align="center">53.6</td>
112
- <td align="center">63.8</td>
113
- <td align="center">69.1</td>
114
- </tr>
115
- <tr>
116
- <td>Kimi k1.5</td>
117
- <td align="center">✅</td>
118
- <td align="center">96.2</td>
119
- <td align="center">77.5</td>
120
- <td align="center">-</td>
121
- <td align="center">74.9</td>
122
- <td align="center">70.0</td>
123
- </tr>
124
- <tr>
125
- <td>Qwen2.5-VL-72B-Instruct</td>
126
- <td align="center">✅</td>
127
- <td align="center">-</td>
128
- <td align="center">-</td>
129
- <td align="center">-</td>
130
- <td align="center">74.8</td>
131
- <td align="center">70.2</td>
132
- </tr>
133
- <tr>
134
- <td>LLaVA-Onevision-72B</td>
135
- <td align="center">✅</td>
136
- <td align="center">-</td>
137
- <td align="center">-</td>
138
- <td align="center">-</td>
139
- <td align="center">67.5</td>
140
- <td align="center">56.8</td>
141
- </tr>
142
- <tr>
143
- <td>InternVL2-Llama3-76B</td>
144
- <td align="center">✅</td>
145
- <td align="center">-</td>
146
- <td align="center">-</td>
147
- <td align="center">-</td>
148
- <td align="center">65.5</td>
149
- <td align="center">58.3</td>
150
- </tr>
151
- <tr>
152
- <td>InternVL2.5-78B</td>
153
- <td align="center">✅</td>
154
- <td align="center">-</td>
155
- <td align="center">-</td>
156
- <td align="center">-</td>
157
- <td align="center">72.3</td>
158
- <td align="center">70.1</td>
159
- </tr>
160
- <tr>
161
- <td>Skywork-R1V-38B</td>
162
- <td align="center">✅</td>
163
- <td align="center">94.0</td>
164
- <td align="center">72.0</td>
165
- <td align="center">61.6</td>
166
- <td align="center">67.5</td>
167
- <td align="center">69.0</td>
168
- </tr>
169
- </tbody>
170
- </table>
171
-
172
- <div align="center">
173
- <b>Comparison with Larger-Scale Open-Source and Closed-Source Models</b>
174
- </div>
175
-
176
- <table align="center">
177
- <thead>
178
- <tr>
179
- <th></th>
180
- <th align="center"><strong>Benchmark</strong></th>
181
- <th align="center"><strong>LLM</strong></th>
182
- <th align="center" colspan="4"><strong>VLM</strong></th>
183
- </tr>
184
- <tr>
185
- <th></th>
186
- <th></th>
187
- <th align="center"><strong>QwQ-32B-Preview</strong></th>
188
- <th align="center"><strong>InternVL-2.5-38B</strong></th>
189
- <th align="center"><strong>VILA 1.5-40B</strong></th>
190
- <th align="center"><strong>InternVL2-40B</strong></th>
191
- <th align="center"><strong>Skywork-R1V-38B</strong></th>
192
- </tr>
193
- </thead>
194
- <tbody>
195
- <tr>
196
- <td rowspan="3">Reasoning</td>
197
- <td>MATH-500</td>
198
- <td align="center">90.6</td>
199
- <td align="center">-</td>
200
- <td align="center">-</td>
201
- <td align="center">-</td>
202
- <td align="center"><strong>94.0</strong></td>
203
- </tr>
204
- <tr>
205
- <td>AIME 2024</td>
206
- <td align="center">50.0</td>
207
- <td align="center">-</td>
208
- <td align="center">-</td>
209
- <td align="center">-</td>
210
- <td align="center"><strong>72.0</strong></td>
211
- </tr>
212
- <tr>
213
- <td>GPQA</td>
214
- <td align="center">65.2</td>
215
- <td align="center">-</td>
216
- <td align="center">-</td>
217
- <td align="center">-</td>
218
- <td align="center">61.6</td>
219
- </tr>
220
- <tr>
221
- <td rowspan="3">Vision</td>
222
- <td>MathVista(mini)</td>
223
- <td align="center">-</td>
224
- <td align="center">71.9</td>
225
- <td align="center">49.5</td>
226
- <td align="center">63.7</td>
227
- <td align="center">67.5</td>
228
- </tr>
229
- <tr>
230
- <td>MMMU(Val)</td>
231
- <td align="center">-</td>
232
- <td align="center">63.9</td>
233
- <td align="center">55.1</td>
234
- <td align="center">55.2</td>
235
- <td align="center">69.0</td>
236
- </tr>
237
- <tr>
238
- <td>CSVQA</td>
239
- <td align="center">-</td>
240
- <td align="center"></td>
241
- <td align="center"></td>
242
- <td align="center"></td>
243
- <td align="center"></td>
244
- </tr>
245
- </tbody>
246
- </table>
247
-
248
- ## 4. Skywork-R1V家族
249
-
250
- | Model Name | Vision Encoder | Language Model | HF Link |
251
- | ---------------------- | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ------------ |
252
- | Skywork-R1V-38B | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | [🤗 Link](#) |
253
- | Skywork-R1V-38B-qwq | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | - |
254
-
255
- ---
256
-
257
- ## 5. 快速开始
258
-
259
- **示例步骤:**
260
-
261
- 1. **克隆GitHub仓库**
262
- ```bash
263
- git clone https://github.com/your-repo
264
- ```
265
-
266
- 2. **安装依赖**
267
- ```bash
268
- cd your-repo
269
- pip install -r requirements.txt
270
- ```
271
-
272
- 3. **运行示例代码**
273
- ```bash
274
- python demo.py
275
- ```
276
-
277
- ---
278
-
279
- ## 6. 引用
280
- 如果您在研究中使用了Skywork-R1V,请引用:
281
-
282
- ```
283
- @article{skywork2025r1v,
284
- title = {Skywork R1V: Bridging Vision and Language for Advanced Multimodal Reasoning},
285
- author = {Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou},
286
- year = {2025},
287
- journal = {arXiv preprint arXiv:XXXX.XXXXX},
288
- url = {https://github.com/skywork-ai/Skywork-R1V}
289
- }
290
- ```
291
-
292
- *本项目采用开源许可证发布。*
293
-