Visual Question Answering
Safetensors
English
qwen3_vl
LHL3341 commited on
Commit
377bb9e
Β·
verified Β·
1 Parent(s): 21df974

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - OpenDataArena/MMFineReason-1.8M
5
+ language:
6
+ - en
7
+ pipeline_tag: visual-question-answering
8
+ base_model:
9
+ - Qwen/Qwen3-VL-4B-Instruct
10
+ ---
11
+ <div align="center">
12
+ <h1>MMFineReason</h1>
13
+ <p><strong>Closing the Multimodal Reasoning Gap via Open Data-Centric Methods</strong></p>
14
+ </div>
15
+
16
+ <div align="center">
17
+
18
+ [![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2601.21821)
19
+ [![Homepage](https://img.shields.io/badge/Homepage-MMFineReason-blue)](https://mmfinereason.github.io/)
20
+ [![Collections](https://img.shields.io/badge/πŸ€—-Collections-yellow)](https://huggingface.co/collections/OpenDataArena/mmfinereason)
21
+
22
+ </div>
23
+
24
+ <figure align="center">
25
+ <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/model_compare.png" width="100%" alt="Model Performance Comparison">
26
+ <figcaption><em>Average score across mathematical reasoning and multimodal understanding benchmarks.</em></figcaption>
27
+ </figure>
28
+
29
+ ---
30
+ This repository provides **MMFineReason-4B**; detailed dataset information is available at https://huggingface.co/datasets/OpenDataArena/MMFineReason-1.8M.
31
+
32
+ ## πŸ“– Overview
33
+
34
+ **MMFineReason** is a large-scale, high-quality multimodal reasoning dataset comprising **1.8M samples** and **5.1B solution tokens**, featuring detailed reasoning annotations distilled from **Qwen3-VL-235B-A22B-Thinking**.
35
+
36
+ ### 🎯 Key Highlights
37
+
38
+ - **1.8M High-Quality Samples** with **5.1B Solution Tokens**
39
+ - **Long-Form CoT**: Average reasoning length of **2,910 tokens** (2.7Γ— HoneyBee, 4.3Γ— OpenMMReasoner)
40
+ - **100% Caption Coverage**: Dense visual descriptions averaging 609 tokens
41
+ - **Multi-Domain**: Mathematics (79.4%), Science (13.8%), Puzzle/Game (4.6%), General/OCR (2.2%)
42
+ - **State-of-the-Art**: Models trained on this dataset achieve SOTA performance in their size class
43
+
44
+ ## 🧠 Model Training
45
+ Based on the MMFineReason dataset, we train a family of multimodal reasoning models at 2B / 4B / 8B scales, all initialized from the corresponding Qwen3-VL-Instruct backbones and fine-tuned using a unified data-centric training recipe.
46
+
47
+ Each MMFineReason model is trained in two stages:
48
+
49
+ - **Supervised Fine-Tuning (SFT)** on MMFineReason-1.8M-SFT, leveraging long-form, visually grounded Chain-of-Thought (CoT) annotations with an average length of 2,910 tokens.
50
+
51
+ - **Reinforcement Learning (RL)** using GSPO, applied on MMFineReason-1.8M-RL to further improve reasoning reliability and generalization.
52
+
53
+ ---
54
+ ## πŸ“Š Model Performance
55
+
56
+ ### Main Results
57
+
58
+ <figure align="center">
59
+ <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_main_results.png" width="100%" alt="Main Benchmark Results">
60
+ <figcaption><em>Comparison of MMFineReason models with state-of-the-art models.</em></figcaption>
61
+ </figure>
62
+
63
+ MMFineReason-4B surpasses Qwen3-VL-8B-Thinking (73.9 vs 72.5), while MMFineReason-8B outperforms the larger Qwen3-VL-30B-A3B-Thinking (75.7 vs 74.5) and exceeds Gemini-2.5-Flash. On mathematical benchmarks, MFR-8B achieves 83.4% on DynaMath (vs Qwen3-VL-32B-Thinking's 82.0%) and 67.1% on MathVision, outperforming HoneyBee-8B and OMR-7B by 23-30 points. Despite minimal chart training data, MFR-8B generalizes well to CharXiv (90.8%) and RealWorldQA (75.6%).
64
+
65
+ ### SFT vs RL Training Analysis
66
+
67
+ <figure align="center">
68
+ <img src="https://raw.githubusercontent.com/mmfinereason/mmfinereason.github.io/main/static/images/table_sft_rl_results.png" width="100%" alt="SFT vs RL Results">
69
+ <figcaption><em>Results comparing MFR-SFT and MFR-Thinking models against base Qwen3-VL variants.</em></figcaption>
70
+ </figure>
71
+
72
+ SFT drives major gains in mathematical reasoning (e.g., MathVision: 53.9% β†’ 67.6% for 8B). RL enhances generalization on understanding benchmarks (e.g., AI2D: 78.5% β†’ 82.5% for 2B) while showing variance on math benchmarks.
73
+
74
+ ## πŸ† Model Zoo
75
+
76
+ | Model | Parameters | Avg Score | HuggingFace |
77
+ |-------|------------|-----------|-------------|
78
+ | MMFineReason-2B | 2B | 65.3 | [πŸ€— Link](https://huggingface.co/OpenDataArena/MMFineReason-2B) |
79
+ | MMFineReason-4B | 4B | 73.9 | [πŸ€— Link](https://huggingface.co/OpenDataArena/MMFineReason-4B) |
80
+ | MMFineReason-8B | 8B | 75.7 | [πŸ€— Link](https://huggingface.co/OpenDataArena/MMFineReason-8B) |
81
+
82
+ ---
83
+
84
+ ## πŸ“š Citation
85
+
86
+ ```bibtex
87
+ @article{lin2026mmfinereason,
88
+ title={MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods},
89
+ author={Lin, Honglin and Liu, Zheng and Zhu, Yun and Qin, Chonghan and Lin, Juekai and Shang, Xiaoran and He, Conghui and Zhang, Wentao and Wu, Lijun},
90
+ journal={arXiv preprint arXiv:2601.21821},
91
+ year={2026},
92
+ url={https://mmfinereason.github.io/}
93
+ }
94
+ ```
95
+
96
+ ---
97
+
98
+ ## πŸ“„ License
99
+
100
+ This dataset is released under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0). Individual source datasets may have their own licenses.
101
+
102
+ ---
103
+
104
+ ## 🀝 Acknowledgments
105
+
106
+ We thank the creators of FineVision, MMR1, BMMR, Euclid30K, GameQA-140K, LLaVA-CoT, WeMath, ViRL39K, and others. We also thank the Qwen team for the powerful Qwen3-VL series models.