Delete README_ZH.md
Browse files- README_ZH.md +0 -293
README_ZH.md
DELETED
|
@@ -1,293 +0,0 @@
|
|
| 1 |
-
# Skywork-R1V
|
| 2 |
-
|
| 3 |
-
<div align="center">
|
| 4 |
-
<img src="logo.jpeg" alt="Introduction Image" width="400" height="400">
|
| 5 |
-
</div>
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
## 1. 介绍
|
| 9 |
-
|
| 10 |
-
我们推出Skywork-R1V,一种多模态推理模型,通过近乎无损的迁移方法,将R1系列文本模型扩展到视觉模态。Skywork-R1V采用轻量级视觉投影器,无需重新训练基础语言模型或视觉编码器,即可实现无缝的多模态适配。为提升视觉-文本对齐,我们开发了结合迭代监督微调(SFT)与组相对策略优化(GRPO)的混合优化策略,显著提高了跨模态融合能力。此外,我们创造了一种自适应长度的思维链(Chain-of-Thought)蒸馏方法用于生成推理数据,动态优化推理链长度以提高推理效率并避免过度推理。该模型在重要多模态推理基准测试中达到最先进水平,在MMMU上得分69.0,在MathVista上得分67.5,可与领先的闭源模型(如Gemini 2.0和Kimi-k1.5)媲美。同时,它还保持了出色的文本推理能力,在AIME达到72.6分,在MATH500达到94.3分。
|
| 11 |
-
|
| 12 |
-
## 2. 模型概述
|
| 13 |
-
|
| 14 |
-
**架构:**
|
| 15 |
-
|
| 16 |
-
Skywork-R1V采用模块化架构,有效结合视觉和语言能力:
|
| 17 |
-
- **视觉编码器:** 使用视觉Transformer (ViT)作为视觉主干处理图像输入。
|
| 18 |
-
- **视觉投影器:** 轻量级MLP适配器,作为视觉与语言组件间的桥梁。
|
| 19 |
-
- **语言模型:** 采用R1-distilled-Qwen-32B作为具备推理能力的语言模型主干。
|
| 20 |
-
|
| 21 |
-
模型连接模式为视觉编码器 → MLP适配器 → 语言模型,其中MLP适配器将视觉编码器的输出空间与语言模型的输入空间对齐。这种设计可高效地将文本的推理能力迁移到多模态领域,无需大规模重新训练视觉编码器或语言模型。
|
| 22 |
-
|
| 23 |
-
**关键设计**
|
| 24 |
-
- **先进的多模态推理**
|
| 25 |
-
擅长跨文本和视觉模态的复杂推理。
|
| 26 |
-
- **迭代训练策略**
|
| 27 |
-
采用迭代监督和GRPO优化模型对齐和性能。
|
| 28 |
-
- **自适应长度思维链**
|
| 29 |
-
动态调整推理长度以增强推理效率和准确性。
|
| 30 |
-
- **可扩展性能**
|
| 31 |
-
在数学、编程和多模态任务上性能媲美专有模型。
|
| 32 |
-
|
| 33 |
-
## 3. 评估
|
| 34 |
-
|
| 35 |
-
<div align="center">
|
| 36 |
-
<img src="eval.jpeg" width="600" height="200" alt="skywork_r1v_eval" />
|
| 37 |
-
</div>
|
| 38 |
-
|
| 39 |
-
<div align="center">
|
| 40 |
-
<b>Evaluation results of LLMs and VLMs</b>
|
| 41 |
-
</div>
|
| 42 |
-
<table>
|
| 43 |
-
<thead>
|
| 44 |
-
<tr>
|
| 45 |
-
<th></th>
|
| 46 |
-
<th align="center"><strong>Vision</strong></th>
|
| 47 |
-
<th align="center" colspan="3"><strong>Reasoning</strong></th>
|
| 48 |
-
<th align="center" colspan="3"><strong>Vision</strong></th>
|
| 49 |
-
</tr>
|
| 50 |
-
<tr>
|
| 51 |
-
<th></th>
|
| 52 |
-
<th></th>
|
| 53 |
-
<th align="center"><strong>MATH-500</strong></th>
|
| 54 |
-
<th align="center"><strong>AIME 2024</strong></th>
|
| 55 |
-
<th align="center"><strong>GPQA</strong></th>
|
| 56 |
-
<th align="center"><strong>MathVista(mini)</strong></th>
|
| 57 |
-
<th align="center"><strong>MMMU(Val)</strong></th>
|
| 58 |
-
</tr>
|
| 59 |
-
<tr>
|
| 60 |
-
<th></th>
|
| 61 |
-
<th></th>
|
| 62 |
-
<th align="center">pass@1</th>
|
| 63 |
-
<th align="center">pass@1</th>
|
| 64 |
-
<th align="center">pass@1</th>
|
| 65 |
-
<th align="center">pass@1</th>
|
| 66 |
-
<th align="center">pass@1</th>
|
| 67 |
-
</tr>
|
| 68 |
-
</thead>
|
| 69 |
-
<tbody>
|
| 70 |
-
<tr>
|
| 71 |
-
<td>Qwen2.5-72B-Instruct</td>
|
| 72 |
-
<td align="center">❌</td>
|
| 73 |
-
<td align="center">82.6</td>
|
| 74 |
-
<td align="center">23.3</td>
|
| 75 |
-
<td align="center">49.0</td>
|
| 76 |
-
<td align="center">-</td>
|
| 77 |
-
<td align="center">-</td>
|
| 78 |
-
</tr>
|
| 79 |
-
<tr>
|
| 80 |
-
<td>Deepseek V3</td>
|
| 81 |
-
<td align="center">❌</td>
|
| 82 |
-
<td align="center">90.2</td>
|
| 83 |
-
<td align="center">39.2</td>
|
| 84 |
-
<td align="center">59.1</td>
|
| 85 |
-
<td align="center">-</td>
|
| 86 |
-
<td align="center">-</td>
|
| 87 |
-
</tr>
|
| 88 |
-
<tr>
|
| 89 |
-
<td>Deepseek R1</td>
|
| 90 |
-
<td align="center">❌</td>
|
| 91 |
-
<td align="center">97.3</td>
|
| 92 |
-
<td align="center">79.8</td>
|
| 93 |
-
<td align="center">71.5</td>
|
| 94 |
-
<td align="center">-</td>
|
| 95 |
-
<td align="center">-</td>
|
| 96 |
-
</tr>
|
| 97 |
-
<tr>
|
| 98 |
-
<td>Claude 3.5 Sonnet</td>
|
| 99 |
-
<td align="center">✅</td>
|
| 100 |
-
<td align="center">78.3</td>
|
| 101 |
-
<td align="center">16.0</td>
|
| 102 |
-
<td align="center">65.0</td>
|
| 103 |
-
<td align="center">67.7</td>
|
| 104 |
-
<td align="center">68.3</td>
|
| 105 |
-
</tr>
|
| 106 |
-
<tr>
|
| 107 |
-
<td>GPT-4o</td>
|
| 108 |
-
<td align="center">✅</td>
|
| 109 |
-
<td align="center">76.6</td>
|
| 110 |
-
<td align="center">9.3</td>
|
| 111 |
-
<td align="center">53.6</td>
|
| 112 |
-
<td align="center">63.8</td>
|
| 113 |
-
<td align="center">69.1</td>
|
| 114 |
-
</tr>
|
| 115 |
-
<tr>
|
| 116 |
-
<td>Kimi k1.5</td>
|
| 117 |
-
<td align="center">✅</td>
|
| 118 |
-
<td align="center">96.2</td>
|
| 119 |
-
<td align="center">77.5</td>
|
| 120 |
-
<td align="center">-</td>
|
| 121 |
-
<td align="center">74.9</td>
|
| 122 |
-
<td align="center">70.0</td>
|
| 123 |
-
</tr>
|
| 124 |
-
<tr>
|
| 125 |
-
<td>Qwen2.5-VL-72B-Instruct</td>
|
| 126 |
-
<td align="center">✅</td>
|
| 127 |
-
<td align="center">-</td>
|
| 128 |
-
<td align="center">-</td>
|
| 129 |
-
<td align="center">-</td>
|
| 130 |
-
<td align="center">74.8</td>
|
| 131 |
-
<td align="center">70.2</td>
|
| 132 |
-
</tr>
|
| 133 |
-
<tr>
|
| 134 |
-
<td>LLaVA-Onevision-72B</td>
|
| 135 |
-
<td align="center">✅</td>
|
| 136 |
-
<td align="center">-</td>
|
| 137 |
-
<td align="center">-</td>
|
| 138 |
-
<td align="center">-</td>
|
| 139 |
-
<td align="center">67.5</td>
|
| 140 |
-
<td align="center">56.8</td>
|
| 141 |
-
</tr>
|
| 142 |
-
<tr>
|
| 143 |
-
<td>InternVL2-Llama3-76B</td>
|
| 144 |
-
<td align="center">✅</td>
|
| 145 |
-
<td align="center">-</td>
|
| 146 |
-
<td align="center">-</td>
|
| 147 |
-
<td align="center">-</td>
|
| 148 |
-
<td align="center">65.5</td>
|
| 149 |
-
<td align="center">58.3</td>
|
| 150 |
-
</tr>
|
| 151 |
-
<tr>
|
| 152 |
-
<td>InternVL2.5-78B</td>
|
| 153 |
-
<td align="center">✅</td>
|
| 154 |
-
<td align="center">-</td>
|
| 155 |
-
<td align="center">-</td>
|
| 156 |
-
<td align="center">-</td>
|
| 157 |
-
<td align="center">72.3</td>
|
| 158 |
-
<td align="center">70.1</td>
|
| 159 |
-
</tr>
|
| 160 |
-
<tr>
|
| 161 |
-
<td>Skywork-R1V-38B</td>
|
| 162 |
-
<td align="center">✅</td>
|
| 163 |
-
<td align="center">94.0</td>
|
| 164 |
-
<td align="center">72.0</td>
|
| 165 |
-
<td align="center">61.6</td>
|
| 166 |
-
<td align="center">67.5</td>
|
| 167 |
-
<td align="center">69.0</td>
|
| 168 |
-
</tr>
|
| 169 |
-
</tbody>
|
| 170 |
-
</table>
|
| 171 |
-
|
| 172 |
-
<div align="center">
|
| 173 |
-
<b>Comparison with Larger-Scale Open-Source and Closed-Source Models</b>
|
| 174 |
-
</div>
|
| 175 |
-
|
| 176 |
-
<table align="center">
|
| 177 |
-
<thead>
|
| 178 |
-
<tr>
|
| 179 |
-
<th></th>
|
| 180 |
-
<th align="center"><strong>Benchmark</strong></th>
|
| 181 |
-
<th align="center"><strong>LLM</strong></th>
|
| 182 |
-
<th align="center" colspan="4"><strong>VLM</strong></th>
|
| 183 |
-
</tr>
|
| 184 |
-
<tr>
|
| 185 |
-
<th></th>
|
| 186 |
-
<th></th>
|
| 187 |
-
<th align="center"><strong>QwQ-32B-Preview</strong></th>
|
| 188 |
-
<th align="center"><strong>InternVL-2.5-38B</strong></th>
|
| 189 |
-
<th align="center"><strong>VILA 1.5-40B</strong></th>
|
| 190 |
-
<th align="center"><strong>InternVL2-40B</strong></th>
|
| 191 |
-
<th align="center"><strong>Skywork-R1V-38B</strong></th>
|
| 192 |
-
</tr>
|
| 193 |
-
</thead>
|
| 194 |
-
<tbody>
|
| 195 |
-
<tr>
|
| 196 |
-
<td rowspan="3">Reasoning</td>
|
| 197 |
-
<td>MATH-500</td>
|
| 198 |
-
<td align="center">90.6</td>
|
| 199 |
-
<td align="center">-</td>
|
| 200 |
-
<td align="center">-</td>
|
| 201 |
-
<td align="center">-</td>
|
| 202 |
-
<td align="center"><strong>94.0</strong></td>
|
| 203 |
-
</tr>
|
| 204 |
-
<tr>
|
| 205 |
-
<td>AIME 2024</td>
|
| 206 |
-
<td align="center">50.0</td>
|
| 207 |
-
<td align="center">-</td>
|
| 208 |
-
<td align="center">-</td>
|
| 209 |
-
<td align="center">-</td>
|
| 210 |
-
<td align="center"><strong>72.0</strong></td>
|
| 211 |
-
</tr>
|
| 212 |
-
<tr>
|
| 213 |
-
<td>GPQA</td>
|
| 214 |
-
<td align="center">65.2</td>
|
| 215 |
-
<td align="center">-</td>
|
| 216 |
-
<td align="center">-</td>
|
| 217 |
-
<td align="center">-</td>
|
| 218 |
-
<td align="center">61.6</td>
|
| 219 |
-
</tr>
|
| 220 |
-
<tr>
|
| 221 |
-
<td rowspan="3">Vision</td>
|
| 222 |
-
<td>MathVista(mini)</td>
|
| 223 |
-
<td align="center">-</td>
|
| 224 |
-
<td align="center">71.9</td>
|
| 225 |
-
<td align="center">49.5</td>
|
| 226 |
-
<td align="center">63.7</td>
|
| 227 |
-
<td align="center">67.5</td>
|
| 228 |
-
</tr>
|
| 229 |
-
<tr>
|
| 230 |
-
<td>MMMU(Val)</td>
|
| 231 |
-
<td align="center">-</td>
|
| 232 |
-
<td align="center">63.9</td>
|
| 233 |
-
<td align="center">55.1</td>
|
| 234 |
-
<td align="center">55.2</td>
|
| 235 |
-
<td align="center">69.0</td>
|
| 236 |
-
</tr>
|
| 237 |
-
<tr>
|
| 238 |
-
<td>CSVQA</td>
|
| 239 |
-
<td align="center">-</td>
|
| 240 |
-
<td align="center"></td>
|
| 241 |
-
<td align="center"></td>
|
| 242 |
-
<td align="center"></td>
|
| 243 |
-
<td align="center"></td>
|
| 244 |
-
</tr>
|
| 245 |
-
</tbody>
|
| 246 |
-
</table>
|
| 247 |
-
|
| 248 |
-
## 4. Skywork-R1V家族
|
| 249 |
-
|
| 250 |
-
| Model Name | Vision Encoder | Language Model | HF Link |
|
| 251 |
-
| ---------------------- | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ------------ |
|
| 252 |
-
| Skywork-R1V-38B | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | [🤗 Link](#) |
|
| 253 |
-
| Skywork-R1V-38B-qwq | [InternViT-6B-448px-V2_5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) | [Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | - |
|
| 254 |
-
|
| 255 |
-
---
|
| 256 |
-
|
| 257 |
-
## 5. 快速开始
|
| 258 |
-
|
| 259 |
-
**示例步骤:**
|
| 260 |
-
|
| 261 |
-
1. **克隆GitHub仓库**
|
| 262 |
-
```bash
|
| 263 |
-
git clone https://github.com/your-repo
|
| 264 |
-
```
|
| 265 |
-
|
| 266 |
-
2. **安装依赖**
|
| 267 |
-
```bash
|
| 268 |
-
cd your-repo
|
| 269 |
-
pip install -r requirements.txt
|
| 270 |
-
```
|
| 271 |
-
|
| 272 |
-
3. **运行示例代码**
|
| 273 |
-
```bash
|
| 274 |
-
python demo.py
|
| 275 |
-
```
|
| 276 |
-
|
| 277 |
-
---
|
| 278 |
-
|
| 279 |
-
## 6. 引用
|
| 280 |
-
如果您在研究中使用了Skywork-R1V,请引用:
|
| 281 |
-
|
| 282 |
-
```
|
| 283 |
-
@article{skywork2025r1v,
|
| 284 |
-
title = {Skywork R1V: Bridging Vision and Language for Advanced Multimodal Reasoning},
|
| 285 |
-
author = {Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou},
|
| 286 |
-
year = {2025},
|
| 287 |
-
journal = {arXiv preprint arXiv:XXXX.XXXXX},
|
| 288 |
-
url = {https://github.com/skywork-ai/Skywork-R1V}
|
| 289 |
-
}
|
| 290 |
-
```
|
| 291 |
-
|
| 292 |
-
*本项目采用开源许可证发布。*
|
| 293 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|