File size: 8,789 Bytes
99aecb4
888e11d
 
8553666
 
 
 
 
888e11d
 
99aecb4
8553666
cc03c03
 
 
99aecb4
888e11d
 
14bc2f4
 
 
 
 
 
 
 
99aecb4
 
14bc2f4
99aecb4
8553666
40892ef
c510b41
99aecb4
888e11d
 
99aecb4
 
888e11d
 
 
 
 
 
 
c510b41
 
 
888e11d
 
99aecb4
 
9e2eb13
adaa7d6
 
 
 
 
 
 
 
 
 
8553666
adaa7d6
8ac9519
99aecb4
74a3d7c
 
 
 
 
 
99aecb4
318fc98
586e1f5
318fc98
 
 
 
f921e66
e0f5026
318fc98
e0f5026
318fc98
 
e0f5026
 
 
f921e66
e0f5026
 
 
 
 
 
318fc98
e0f5026
318fc98
e0f5026
318fc98
99aecb4
c66096e
99aecb4
888e11d
 
 
 
bf9410a
888e11d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8553666
 
888e11d
 
 
8553666
 
888e11d
 
 
99aecb4
 
 
6ff4a6e
99aecb4
888e11d
99aecb4
 
 
 
 
c66096e
99aecb4
 
 
 
9178b22
 
 
 
 
 
 
 
99aecb4
cc03c03
99aecb4
c66096e
9dc9bb2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
base_model:
- OpenGVLab/InternVL3-38B
language:
- en
library_name: transformers
license: mit
pipeline_tag: image-text-to-text
tags:
- Skywork R1V
---

<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->

<div align="center">
  <img src="skywork-logo.png" alt="Skywork Logo" width="400">
   <b>
    <span>======================================</span>
    <br/>
     Skywork-R1V3
    <br/>
    <span>======================================</span>
    <br/>
  </b>
</div>


<p align="center">
    <a href="https://huggingface.co/papers/2507.06167"><strong>πŸ“– R1V3 Report</strong></a> |
    <a href="https://github.com/SkyworkAI/Skywork-R1V"><strong>πŸ’» GitHub</strong></a> 
  
</p>
<!-- # Skywork-R1V3 -->



<p align="center">
    <a href="https://github.com/SkyworkAI/Skywork-R1V/stargazers">
        <img src="https://img.shields.io/github/stars/SkyworkAI/Skywork-R1V?style=social" alt="GitHub Stars">
    </a>
    <a href="https://github.com/SkyworkAI/Skywork-R1V/fork">
        <img src="https://img.shields.io/github/forks/SkyworkAI/Skywork-R1V?style=social" alt="GitHub Forks">
    </a>
     <a href="https://github.com/SkyworkAI/Skywork-R1V/blob/main/LICENSE">
        <img src="https://img.shields.io/github/license/SkyworkAI/Skywork-R1V" alt="License">
    </a>
</p>

## 1. Model Introduction

Skywork-R1V3-38B is the latest and most powerful open-source multimodal reasoning model in the Skywork-R1V series. Built on InternVL-38B, it significantly pushes the boundaries of multimodal and cross-disciplinary intelligence. **Mainly through RL algorithm in post-training**, R1V3 boasts enhanced reasoning ability, achieving open-source state-of-the-art (SOTA) performance across numerous multimodal reasoning benchmarks.

## 2. Technical Highlights
Skywork-R1V3 is an advanced, open-source Vision-Language Model (VLM) built on several core innovations:

- **Refined Post-Training RL**: Instead of relying on reasoning pre-training, our fine-grained cold-start finetuning effectively primes the model for Reinforcement Learning (RL), which dramatically enhances its reasoning ability.

- **Essential Connector Module**: We've uncovered the critical role of the connector module in achieving robust cross-modal alignment for strong multimodal reasoning. What's more, Connector-only Finetuning can further boost the model's performance post-RL.

- **Entropy of Critical Reasoning Tokens**: This unique indicator effectively gauges reasoning capability, guiding checkpoint selection during RL training.

These innovations lead to Broad Reasoning Generalization, allowing our RL-powered approach to successfully extend mathematical reasoning to diverse subject areas. Additionally, our work delves into RL-specific explorations like curriculum learning and learning rate strategies, alongside a broader discussion on multimodal reasoning. For more details, refer to our [[πŸ“– R1V3 Report](https://huggingface.co/papers/2507.06167)]Β .
## 3. Evaluation

### 🌟 Key Results
- **MMMU:** 76.0
- **EMMA-Mini(CoT):** 40.3 
- **MMK12:** 78.5 
- **Physics Reasoning:** PhyX-MC-TM (52.8), SeePhys (31.5) 
- **Logic Reasoning:** MME-Reasoning (42.8)  VisuLogic (28.5)
- **Math Benchmarks:** MathVista (77.1), MathVerse (59.6), MathVision (52.6)

<!-- <div align="center">
  <img src="https://huggingface.co/Skywork/Skywork-R1V3-38B/resolve/main/eval.png" width="800">
</div> -->

# Visual-Language Models Benchmark Comparison

| Category       | Benchmark               | Metric  | Skywork-38B | QVQ-72B | InternVL-78B | QwenVL-72B | Claude 3.7 | GPT-4o |
|----------------|-------------------------|---------|------------:|--------:|-------------:|--------:|----------:|---------:|
| **General**    | MMMU (val)              | Acc.    | πŸ† **76.0**   | 70.3    | 72.2         | 70.3    | 75.0      | 70.7   |
|                | EMMA (mini-cot)         | Acc.    | 40.3       | 32.0    | 38.3         | 39.3    |  **56.5**   | 36.0   |
|                | MMMU-pro                | Acc.    | πŸ† **55.4**   | 46.9*   | 48.6         | 51.1    | 50.0      | 54.5   |
|                | MMK12                   | Acc.    | πŸ† **78.5**    | 62.7*   | 67.4*        | 70.5*   | 55.3      | 49.9   |
|                | MMstar                  | Acc.    | 70.6       | 60.8    |  **72.5**     | 70.8    | 68.8      | 65.1   |
|                | MMBench-en-1.1          | Acc.    | 85.7       | 72.6*   | 87.7         |  **88.0** | 82.0      | 84.3   |
|                | HallusionBench          | Acc.    | πŸ† **61.3**   | 55.3*   | 59.1         | 55.2    | 58.3      | 56.2   |
| **Mathematics**| MathVista (mini)        | Acc.    | πŸ† **77.1**       | 71.4    |  72.2      | 74.8    | 66.8      | 62.9   |
|                | MathVerse (vision-only) | Acc.    | πŸ† **59.6**   | 45.1    | 51.0         | 57.6    | 49.9*     | 49.9   |
|                | MathVision              | Acc.    | 52.6       | 35.9    | 43.1         | 38.1    |  58.6   | 31.2   |
|                | WeMath (strict)          | Acc.    |πŸ† **56.5**   | 37.7    | 46.1         | 50.6    | 48.9*     | 50.6   |
| **Logic**      | Visulogic               | Acc.    | πŸ† **28.5**   | 23.5*   | 27.7         | 26.2    | 25.9      | 26.3   |
|                | LogicVista              | Acc.    | 59.7       | 53.8    | 55.9         | 57.1    | 60.6*     |  **64.4** |
|                | MME-reasoning           | Acc.    | πŸ† **42.8**   | 35.2    | 32.1         | 34.1    | 34.1      | 30.2   |
| **Physics**    | PhyX (mc-text-minimal)  | Acc.    | πŸ† **52.8**    | 35.2*   | 40.5         | 44.8    | 41.6      | 43.8   |
|                | SeePhys                 | Acc.    | 31.5       | 22.5    | 19.0*        | 24.2    |  **34.6**   | 21.9   |

πŸ† **Top performer** of Skywork-R1V3 in each benchmark  
[*] indicates results from our evaluation framework.

## 4. Usage

If you need the detailed inference code and evaluation script, please refer to our [GitHub](https://github.com/SkyworkAI/Skywork-R1V).


###  Run the Inference Script
hf inference

```python
import torch
from transformers import AutoModel, AutoTokenizer
from utils import load_image, split_model
import argparse

def main():
    parser = argparse.ArgumentParser(description="Run inference with Skywork-R1V model.")
    parser.add_argument('--model_path', type=str, default='Skywork/Skywork-R1V3-38B', help="Path to the model.")
    parser.add_argument('--image_paths', type=str, nargs='+', required=True, help="Path(s) to the image(s).")
    parser.add_argument('--question', type=str, required=True, help="Question to ask the model.")
    args = parser.parse_args()

    device_map = split_model(args.model_path)
    model = AutoModel.from_pretrained(
        args.model_path,
        torch_dtype=torch.bfloat16,
        load_in_8bit=False,
        low_cpu_mem_usage=True,
        use_flash_attn=True,
        trust_remote_code=True,
        device_map=device_map
    ).eval()
    tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True, use_fast=False)

    pixel_values = [load_image(img_path, max_num=12).to(torch.bfloat16).cuda() for img_path in args.image_paths]
    if len(pixel_values) > 1:
        num_patches_list = [img.size(0) for img in pixel_values]
        pixel_values = torch.cat(pixel_values, dim=0)
    else:
        pixel_values = pixel_values[0]
        num_patches_list = None
        
    prompt = "<image>
"*len(args.image_paths) + args.question
    generation_config = dict(max_new_tokens=64000, do_sample=True, temperature=0.6, top_p=0.95, repetition_penalty=1.05)
    response = model.chat(tokenizer, pixel_values, prompt, generation_config, num_patches_list=num_patches_list)

    print(f'User: {args.question}
Assistant: {response}')

if __name__ == '__main__':
    main()

```

vllm inference
```shell
python -m vllm.entrypoints.openai.api_server --model $MODEL_PATH  --max_model_len 32768  --limit-mm-per-prompt "image=20" --tensor-parallel-size $N_GPU --dtype auto  --trust-remote-code

```

---

## 5. Citation
If you use Skywork-R1V in your research, please cite:


```
@misc{shen2025skyworkr1v3technicalreport,
      title={Skywork-R1V3 Technical Report}, 
      author={Wei Shen and Jiangbo Pei and Yi Peng and Xuchen Song and Yang Liu and Jian Peng and Haofeng Sun and Yunzhuo Hao and Peiyu Wang and Jianhao Zhang and Yahui Zhou},
      year={2025},
      eprint={2507.06167},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.06167}, 
}
```

## 6.License
This project is released under the MIT License. This project uses the [InternVL3-38B](https://huggingface.co/OpenGVLab/InternVL3-38B) as the base model, which is licensed under the MIT License.