Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model:
|
| 6 |
+
- OpenGVLab/InternVL2_5-2B
|
| 7 |
+
pipeline_tag: image-text-to-text
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# SkyworkVL-2B: Multimodal Understanding with Bag of Tricks
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Introduction
|
| 16 |
+
|
| 17 |
+
**SkyworkVL-2B** is an advanced VLM model trained on 2 million high-quality caption and QA data samples. Leveraging innovative techniques across multiple training stages, our model delivers superior performance on a range of vision-language tasks.
|
| 18 |
+
|
| 19 |
+
## 🔑 Key Features
|
| 20 |
+
|
| 21 |
+
### 1. Multi-Resolution Processing
|
| 22 |
+
|
| 23 |
+
- Images are processed at multiple resolutions. For each resolution (from high to low), we apply Closest Aspect Ratio Matching to partition the image into tiles. Finally, the original image is resized into a tile and appended to the final representation—ensuring comprehensive image understanding.
|
| 24 |
+
|
| 25 |
+
### 2. Multi-Stage Supervised Fine-Tuning (SFT)
|
| 26 |
+
|
| 27 |
+
- **Stage 1:** Fine-tuning on the full dataset.
|
| 28 |
+
- **Stage 2:** Refinement using a curated subset of 200K high-scoring samples filtered by GPT-4 evaluations.
|
| 29 |
+
|
| 30 |
+
### 3. High-Quality Chain-of-Thought (CoT) Fine-Tuning
|
| 31 |
+
|
| 32 |
+
- Fine-tuning on 40K high-quality CoT data including self-collected multimodal Chinese Gaokao data with detailed analysis to boost the model’s reasoning capability.
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
## Model Introduction
|
| 36 |
+
|
| 37 |
+
| Model Name | Base Model | Parameters | Download Link |
|
| 38 |
+
| ----------------------- | ------------------------- | ---------- | ----------------------------------------------------------- |
|
| 39 |
+
| SkyworkVL-2B (Current) | OpenGVLab/InternVL2_5-2B | 2B | 🤗 [Download](https://huggingface.co/Skywork/SkyworkVL-2B) |
|
| 40 |
+
|
| 41 |
+
## Performance
|
| 42 |
+
|
| 43 |
+
| Metric | MathVista (testmini) | MMMU (val) | AI2D | OCRBench | MME | **RealWorldQA** | **HallusionBench** |
|
| 44 |
+
| --------------------------- | -------------------- | --------------- | --------------- | ------------- | -------------- | --------------- | ------------------ |
|
| 45 |
+
| Qwen2-VL-2B | 47.8 |42.2 |74.7|797|1899.1|60.7| 42.4 |
|
| 46 |
+
| Internvl2.5-2B | 51.3 | 43.6 | 74.9|804|**2138.2** | 60.1| 42.6 |
|
| 47 |
+
| SkyworkVL-2B |**62.8** | **44.1**| **76.7**|**817** | 1937 | **64.8** |**44.3** |
|
| 48 |
+
|
| 49 |
+
*The performance improvements above demonstrate notable gains in multi-disciplinary question answering, object detection (BBox), and scientific chart analysis among other benchmarks.*
|
| 50 |
+
|
| 51 |
+
## Usage
|
| 52 |
+
|
| 53 |
+
Please refer to the [Guide](https://github.com/YourGitHub/SkyworkVL-2B) for detailed instructions on inference and integration.
|
| 54 |
+
|
| 55 |
+
## Citation
|
| 56 |
+
|
| 57 |
+
```BibTeX
|
| 58 |
+
@misc{SkyworkVL,
|
| 59 |
+
author = {Skywork-VL Team},
|
| 60 |
+
title = {SkyworkVL-2B: Multimodal Understanding with Bag of Tricks},
|
| 61 |
+
year = {2025},
|
| 62 |
+
publisher = {Huggingface},
|
| 63 |
+
journal = {Huggingface repository},
|
| 64 |
+
howpublished = {\url{https://huggingface.co/YourRepo/SkyworkVL-2B}}
|
| 65 |
+
}
|
| 66 |
+
```
|