Skywork
/

SkyworkVL-38B

+---
+license: apache-2.0
+language:
+  - en
+base_model:
+  - OpenGVLab/InternVL2_5-38B
+pipeline_tag: vision-language-model
+---
+# SkyworkVL-38B: Multimodal Understanding with Bag of Tricks
+<p align="center">
+  <img src="assets/skyworkvl_logo.png" alt="SkyworkVL Logo" width="60%">
+</p>
+<p align="center">
+  <a href="https://github.com/YourGitHub/SkyworkVL-38B" target="_blank">🌐 Github</a> · 👋 <a href="https://www.skyworkvl.ai" target="_blank">Playground</a> · 💬 <a href="https://discord.gg/yourdiscord" target="_blank">Discord</a>
+</p>
+---
+This repository contains Diffusers-format model weights for **SkyworkVL-38B**, an advanced vision-language model built on the base model [OpenGVLab/InternVL2_5-38B](https://huggingface.co/OpenGVLab/InternVL2_5-38B).
+## Introduction
+**SkyworkVL-38B** is a state-of-the-art VLM model trained on 2 million high-quality caption and QA data samples. Leveraging innovative techniques across multiple training stages, our model delivers superior performance on a range of vision-language tasks including multi-disciplinary question answering and scientific chart analysis.
+## 🔑 Key Features
+### 1. Multi-Resolution Processing
+- **Innovative Image Tiling:** Images are processed at multiple resolutions. For each resolution, we apply Closest Aspect Ratio Matching to partition the image into tiles. Finally, the original image is resized into a tile and appended to the final representation—ensuring comprehensive image understanding.
+### 2. Multi-Stage Supervised Fine-Tuning (SFT)
+- **Stage 1:** Fine-tuning on the full dataset.
+- **Stage 2:** Refinement using a curated subset of 200K high-scoring samples filtered by GPT-4 evaluations.
+- **Stage 3:** Further fine-tuning using mispredicted samples from Stage 2 alongside a small amount of generic domain data.
+### 3. High-Quality Chain-of-Thought (CoT) Fine-Tuning
+- **Enhanced Reasoning:** Integrates high-quality CoT data including self-collected multimodal Chinese Gaokao data with detailed analysis to boost the model’s reasoning capability.
+### 4. GRPO + Rule-Based Reward Training
+- **Performance Boost:** Utilizes GRPO and rule-based reward training to further refine output quality and overall performance.
+## Model Introduction
+| Model Name              | Base Model                | Parameters | Download Link                                               |
+| ----------------------- | ------------------------- | ---------- | ----------------------------------------------------------- |
+| SkyworkVL-38B (Current) | OpenGVLab/InternVL2_5-38B | 38B        | 🤗 [Download](https://huggingface.co/YourRepo/SkyworkVL-38B) |
+## Performance
+| Metric                      | MathVista (testmini) | MMMU (val)      | AI2D (BBox)     | OCRBench      | MME            | **RealWorldQA** | **HallusionBench** |
+| --------------------------- | -------------------- | --------------- | --------------- | ------------- | -------------- | --------------- | ------------------ |
+| Internvl2.5-38B (官方)      | 71.9                 | 63.9            | 87.6            | 842           | 2455           | 73.5            | 56.8               |
+| **SkyworkVL-38B (Current)** | **74.4 (+2.5)**      | **64.0 (+0.1)** | **88.4 (+0.8)** | **854 (+12)** | **2479 (+24)** | **76.9 (+3.4)** | **58.9 (+2.1)**    |
+*The performance improvements above demonstrate notable gains in multi-disciplinary question answering, object detection (BBox), and scientific chart analysis among other benchmarks.*
+## Usage
+Please refer to the [Guide](https://github.com/YourGitHub/SkyworkVL-38B) for detailed instructions on inference and integration.
+## Citation
+```BibTeX
+@misc{SkyworkVL38B,
+  author = {Skywork-AI},
+  title = {SkyworkVL-38B: Multimodal Understanding with Bag of Tricks},
+  year = {2025},
+  publisher = {Huggingface},
+  journal = {Huggingface repository},
+  howpublished = {\url{https://huggingface.co/YourRepo/SkyworkVL-38B}}
+}
+```