Skywork
/

SkyworkVL-38B

@@ -32,7 +32,7 @@ This repository contains Diffusers-format model weights for **SkyworkVL-38B**, a
 ### 1. Multi-Resolution Processing
-- **Innovative Image Tiling:** Images are processed at multiple resolutions. For each resolution, we apply Closest Aspect Ratio Matching to partition the image into tiles. Finally, the original image is resized into a tile and appended to the final representation—ensuring comprehensive image understanding.
 ### 2. Multi-Stage Supervised Fine-Tuning (SFT)
@@ -42,11 +42,11 @@ This repository contains Diffusers-format model weights for **SkyworkVL-38B**, a
 ### 3. High-Quality Chain-of-Thought (CoT) Fine-Tuning
-- **Enhanced Reasoning:** Integrates high-quality CoT data including self-collected multimodal Chinese Gaokao data with detailed analysis to boost the model’s reasoning capability.
 ### 4. GRPO + Rule-Based Reward Training
-- **Performance Boost:** Utilizes GRPO and rule-based reward training to further refine output quality and overall performance.
 ## Model Introduction
@@ -58,8 +58,8 @@ This repository contains Diffusers-format model weights for **SkyworkVL-38B**, a
 | Metric                      | MathVista (testmini) | MMMU (val)      | AI2D (BBox)     | OCRBench      | MME            | **RealWorldQA** | **HallusionBench** |
 | --------------------------- | -------------------- | --------------- | --------------- | ------------- | -------------- | --------------- | ------------------ |
-| Internvl2.5-38B (官方)      | 71.9                 | 63.9            | 87.6            | 842           | 2455           | 73.5            | 56.8               |
-| **SkyworkVL-38B (Current)** | **74.4 (+2.5)**      | **64.0 (+0.1)** | **88.4 (+0.8)** | **854 (+12)** | **2479 (+24)** | **76.9 (+3.4)** | **58.9 (+2.1)**    |
 *The performance improvements above demonstrate notable gains in multi-disciplinary question answering, object detection (BBox), and scientific chart analysis among other benchmarks.*

 ### 1. Multi-Resolution Processing
+- Images are processed at multiple resolutions. For each resolution, we apply Closest Aspect Ratio Matching to partition the image into tiles. Finally, the original image is resized into a tile and appended to the final representation—ensuring comprehensive image understanding.
 ### 2. Multi-Stage Supervised Fine-Tuning (SFT)
 ### 3. High-Quality Chain-of-Thought (CoT) Fine-Tuning
+- Integrates high-quality CoT data including self-collected multimodal Chinese Gaokao data with detailed analysis to boost the model’s reasoning capability.
 ### 4. GRPO + Rule-Based Reward Training
+- Utilizes GRPO and rule-based reward training to further refine output quality and overall performance.
 ## Model Introduction
 | Metric                      | MathVista (testmini) | MMMU (val)      | AI2D (BBox)     | OCRBench      | MME            | **RealWorldQA** | **HallusionBench** |
 | --------------------------- | -------------------- | --------------- | --------------- | ------------- | -------------- | --------------- | ------------------ |
+| Internvl2.5-38B     | 71.9                 | 63.9            | 87.6            | 842           | 2455           | 73.5            | 56.8               |
+| SkyworkVL-38B  | **74.4**      | **64.0** | **88.4** | **854** | **2479** | **76.9** | **58.9**    |
 *The performance improvements above demonstrate notable gains in multi-disciplinary question answering, object detection (BBox), and scientific chart analysis among other benchmarks.*