jiangbop commited on
Commit
bba27e2
·
verified ·
1 Parent(s): 773e603

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - OpenGVLab/InternVL2_5-2B
7
+ pipeline_tag: image-text-to-text
8
+
9
+ ---
10
+
11
+ # SkyworkVL-2B: Multimodal Understanding with Bag of Tricks
12
+
13
+ ---
14
+
15
+ ## Introduction
16
+
17
+ **SkyworkVL-2B** is an advanced VLM model trained on 2 million high-quality caption and QA data samples. Leveraging innovative techniques across multiple training stages, our model delivers superior performance on a range of vision-language tasks.
18
+
19
+ ## 🔑 Key Features
20
+
21
+ ### 1. Multi-Resolution Processing
22
+
23
+ - Images are processed at multiple resolutions. For each resolution (from high to low), we apply Closest Aspect Ratio Matching to partition the image into tiles. Finally, the original image is resized into a tile and appended to the final representation—ensuring comprehensive image understanding.
24
+
25
+ ### 2. Multi-Stage Supervised Fine-Tuning (SFT)
26
+
27
+ - **Stage 1:** Fine-tuning on the full dataset.
28
+ - **Stage 2:** Refinement using a curated subset of 200K high-scoring samples filtered by GPT-4 evaluations.
29
+
30
+ ### 3. High-Quality Chain-of-Thought (CoT) Fine-Tuning
31
+
32
+ - Fine-tuning on 40K high-quality CoT data including self-collected multimodal Chinese Gaokao data with detailed analysis to boost the model’s reasoning capability.
33
+
34
+
35
+ ## Model Introduction
36
+
37
+ | Model Name | Base Model | Parameters | Download Link |
38
+ | ----------------------- | ------------------------- | ---------- | ----------------------------------------------------------- |
39
+ | SkyworkVL-2B (Current) | OpenGVLab/InternVL2_5-2B | 2B | 🤗 [Download](https://huggingface.co/Skywork/SkyworkVL-2B) |
40
+
41
+ ## Performance
42
+
43
+ | Metric | MathVista (testmini) | MMMU (val) | AI2D | OCRBench | MME | **RealWorldQA** | **HallusionBench** |
44
+ | --------------------------- | -------------------- | --------------- | --------------- | ------------- | -------------- | --------------- | ------------------ |
45
+ | Qwen2-VL-2B | 47.8 |42.2 |74.7|797|1899.1|60.7| 42.4 |
46
+ | Internvl2.5-2B | 51.3 | 43.6 | 74.9|804|**2138.2** | 60.1| 42.6 |
47
+ | SkyworkVL-2B |**62.8** | **44.1**| **76.7**|**817** | 1937 | **64.8** |**44.3** |
48
+
49
+ *The performance improvements above demonstrate notable gains in multi-disciplinary question answering, object detection (BBox), and scientific chart analysis among other benchmarks.*
50
+
51
+ ## Usage
52
+
53
+ Please refer to the [Guide](https://github.com/YourGitHub/SkyworkVL-2B) for detailed instructions on inference and integration.
54
+
55
+ ## Citation
56
+
57
+ ```BibTeX
58
+ @misc{SkyworkVL,
59
+ author = {Skywork-VL Team},
60
+ title = {SkyworkVL-2B: Multimodal Understanding with Bag of Tricks},
61
+ year = {2025},
62
+ publisher = {Huggingface},
63
+ journal = {Huggingface repository},
64
+ howpublished = {\url{https://huggingface.co/YourRepo/SkyworkVL-2B}}
65
+ }
66
+ ```