File size: 8,789 Bytes
99aecb4 888e11d 8553666 888e11d 99aecb4 8553666 cc03c03 99aecb4 888e11d 14bc2f4 99aecb4 14bc2f4 99aecb4 8553666 40892ef c510b41 99aecb4 888e11d 99aecb4 888e11d c510b41 888e11d 99aecb4 9e2eb13 adaa7d6 8553666 adaa7d6 8ac9519 99aecb4 74a3d7c 99aecb4 318fc98 586e1f5 318fc98 f921e66 e0f5026 318fc98 e0f5026 318fc98 e0f5026 f921e66 e0f5026 318fc98 e0f5026 318fc98 e0f5026 318fc98 99aecb4 c66096e 99aecb4 888e11d bf9410a 888e11d 8553666 888e11d 8553666 888e11d 99aecb4 6ff4a6e 99aecb4 888e11d 99aecb4 c66096e 99aecb4 9178b22 99aecb4 cc03c03 99aecb4 c66096e 9dc9bb2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
---
base_model:
- OpenGVLab/InternVL3-38B
language:
- en
library_name: transformers
license: mit
pipeline_tag: image-text-to-text
tags:
- Skywork R1V
---
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->
<div align="center">
<img src="skywork-logo.png" alt="Skywork Logo" width="400">
<b>
<span>======================================</span>
<br/>
Skywork-R1V3
<br/>
<span>======================================</span>
<br/>
</b>
</div>
<p align="center">
<a href="https://huggingface.co/papers/2507.06167"><strong>π R1V3 Report</strong></a> |
<a href="https://github.com/SkyworkAI/Skywork-R1V"><strong>π» GitHub</strong></a>
</p>
<!-- # Skywork-R1V3 -->
<p align="center">
<a href="https://github.com/SkyworkAI/Skywork-R1V/stargazers">
<img src="https://img.shields.io/github/stars/SkyworkAI/Skywork-R1V?style=social" alt="GitHub Stars">
</a>
<a href="https://github.com/SkyworkAI/Skywork-R1V/fork">
<img src="https://img.shields.io/github/forks/SkyworkAI/Skywork-R1V?style=social" alt="GitHub Forks">
</a>
<a href="https://github.com/SkyworkAI/Skywork-R1V/blob/main/LICENSE">
<img src="https://img.shields.io/github/license/SkyworkAI/Skywork-R1V" alt="License">
</a>
</p>
## 1. Model Introduction
Skywork-R1V3-38B is the latest and most powerful open-source multimodal reasoning model in the Skywork-R1V series. Built on InternVL-38B, it significantly pushes the boundaries of multimodal and cross-disciplinary intelligence. **Mainly through RL algorithm in post-training**, R1V3 boasts enhanced reasoning ability, achieving open-source state-of-the-art (SOTA) performance across numerous multimodal reasoning benchmarks.
## 2. Technical Highlights
Skywork-R1V3 is an advanced, open-source Vision-Language Model (VLM) built on several core innovations:
- **Refined Post-Training RL**: Instead of relying on reasoning pre-training, our fine-grained cold-start finetuning effectively primes the model for Reinforcement Learning (RL), which dramatically enhances its reasoning ability.
- **Essential Connector Module**: We've uncovered the critical role of the connector module in achieving robust cross-modal alignment for strong multimodal reasoning. What's more, Connector-only Finetuning can further boost the model's performance post-RL.
- **Entropy of Critical Reasoning Tokens**: This unique indicator effectively gauges reasoning capability, guiding checkpoint selection during RL training.
These innovations lead to Broad Reasoning Generalization, allowing our RL-powered approach to successfully extend mathematical reasoning to diverse subject areas. Additionally, our work delves into RL-specific explorations like curriculum learning and learning rate strategies, alongside a broader discussion on multimodal reasoning. For more details, refer to our [[π R1V3 Report](https://huggingface.co/papers/2507.06167)]Β .
## 3. Evaluation
### π Key Results
- **MMMU:** 76.0
- **EMMA-Mini(CoT):** 40.3
- **MMK12:** 78.5
- **Physics Reasoning:** PhyX-MC-TM (52.8), SeePhys (31.5)
- **Logic Reasoning:** MME-Reasoning (42.8) VisuLogic (28.5)
- **Math Benchmarks:** MathVista (77.1), MathVerse (59.6), MathVision (52.6)
<!-- <div align="center">
<img src="https://huggingface.co/Skywork/Skywork-R1V3-38B/resolve/main/eval.png" width="800">
</div> -->
# Visual-Language Models Benchmark Comparison
| Category | Benchmark | Metric | Skywork-38B | QVQ-72B | InternVL-78B | QwenVL-72B | Claude 3.7 | GPT-4o |
|----------------|-------------------------|---------|------------:|--------:|-------------:|--------:|----------:|---------:|
| **General** | MMMU (val) | Acc. | π **76.0** | 70.3 | 72.2 | 70.3 | 75.0 | 70.7 |
| | EMMA (mini-cot) | Acc. | 40.3 | 32.0 | 38.3 | 39.3 | **56.5** | 36.0 |
| | MMMU-pro | Acc. | π **55.4** | 46.9* | 48.6 | 51.1 | 50.0 | 54.5 |
| | MMK12 | Acc. | π **78.5** | 62.7* | 67.4* | 70.5* | 55.3 | 49.9 |
| | MMstar | Acc. | 70.6 | 60.8 | **72.5** | 70.8 | 68.8 | 65.1 |
| | MMBench-en-1.1 | Acc. | 85.7 | 72.6* | 87.7 | **88.0** | 82.0 | 84.3 |
| | HallusionBench | Acc. | π **61.3** | 55.3* | 59.1 | 55.2 | 58.3 | 56.2 |
| **Mathematics**| MathVista (mini) | Acc. | π **77.1** | 71.4 | 72.2 | 74.8 | 66.8 | 62.9 |
| | MathVerse (vision-only) | Acc. | π **59.6** | 45.1 | 51.0 | 57.6 | 49.9* | 49.9 |
| | MathVision | Acc. | 52.6 | 35.9 | 43.1 | 38.1 | 58.6 | 31.2 |
| | WeMath (strict) | Acc. |π **56.5** | 37.7 | 46.1 | 50.6 | 48.9* | 50.6 |
| **Logic** | Visulogic | Acc. | π **28.5** | 23.5* | 27.7 | 26.2 | 25.9 | 26.3 |
| | LogicVista | Acc. | 59.7 | 53.8 | 55.9 | 57.1 | 60.6* | **64.4** |
| | MME-reasoning | Acc. | π **42.8** | 35.2 | 32.1 | 34.1 | 34.1 | 30.2 |
| **Physics** | PhyX (mc-text-minimal) | Acc. | π **52.8** | 35.2* | 40.5 | 44.8 | 41.6 | 43.8 |
| | SeePhys | Acc. | 31.5 | 22.5 | 19.0* | 24.2 | **34.6** | 21.9 |
π **Top performer** of Skywork-R1V3 in each benchmark
[*] indicates results from our evaluation framework.
## 4. Usage
If you need the detailed inference code and evaluation script, please refer to our [GitHub](https://github.com/SkyworkAI/Skywork-R1V).
### Run the Inference Script
hf inference
```python
import torch
from transformers import AutoModel, AutoTokenizer
from utils import load_image, split_model
import argparse
def main():
parser = argparse.ArgumentParser(description="Run inference with Skywork-R1V model.")
parser.add_argument('--model_path', type=str, default='Skywork/Skywork-R1V3-38B', help="Path to the model.")
parser.add_argument('--image_paths', type=str, nargs='+', required=True, help="Path(s) to the image(s).")
parser.add_argument('--question', type=str, required=True, help="Question to ask the model.")
args = parser.parse_args()
device_map = split_model(args.model_path)
model = AutoModel.from_pretrained(
args.model_path,
torch_dtype=torch.bfloat16,
load_in_8bit=False,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map
).eval()
tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True, use_fast=False)
pixel_values = [load_image(img_path, max_num=12).to(torch.bfloat16).cuda() for img_path in args.image_paths]
if len(pixel_values) > 1:
num_patches_list = [img.size(0) for img in pixel_values]
pixel_values = torch.cat(pixel_values, dim=0)
else:
pixel_values = pixel_values[0]
num_patches_list = None
prompt = "<image>
"*len(args.image_paths) + args.question
generation_config = dict(max_new_tokens=64000, do_sample=True, temperature=0.6, top_p=0.95, repetition_penalty=1.05)
response = model.chat(tokenizer, pixel_values, prompt, generation_config, num_patches_list=num_patches_list)
print(f'User: {args.question}
Assistant: {response}')
if __name__ == '__main__':
main()
```
vllm inference
```shell
python -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --max_model_len 32768 --limit-mm-per-prompt "image=20" --tensor-parallel-size $N_GPU --dtype auto --trust-remote-code
```
---
## 5. Citation
If you use Skywork-R1V in your research, please cite:
```
@misc{shen2025skyworkr1v3technicalreport,
title={Skywork-R1V3 Technical Report},
author={Wei Shen and Jiangbo Pei and Yi Peng and Xuchen Song and Yang Liu and Jian Peng and Haofeng Sun and Yunzhuo Hao and Peiyu Wang and Jianhao Zhang and Yahui Zhou},
year={2025},
eprint={2507.06167},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.06167},
}
```
## 6.License
This project is released under the MIT License. This project uses the [InternVL3-38B](https://huggingface.co/OpenGVLab/InternVL3-38B) as the base model, which is licensed under the MIT License. |