Update model card metadata and add links to paper/code
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,13 +1,16 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
- zh
|
| 6 |
-
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
# Innovator-VL-8B-Thinking
|
| 10 |
|
|
|
|
|
|
|
| 11 |
## Introduction
|
| 12 |
|
| 13 |
**Innovator-VL-8B-Thinking** is a multimodal reasoning-oriented large
|
|
@@ -41,16 +44,18 @@ for reasoning-intensive multimodal scenarios.
|
|
| 41 |
### Explicit Multimodal Reasoning
|
| 42 |
|
| 43 |
Innovator-VL-8B-Thinking is trained to explicitly generate structured
|
| 44 |
-
reasoning traces, enabling the model to:
|
| 45 |
-
deduction grounded in visual evidence
|
| 46 |
-
|
| 47 |
-
contexts
|
| 48 |
|
| 49 |
### Reinforcement Learning for Long-Horizon Reasoning
|
| 50 |
|
| 51 |
The model is further optimized using reinforcement learning to
|
| 52 |
-
improve:
|
| 53 |
-
|
|
|
|
|
|
|
| 54 |
|
| 55 |
Sequence-level optimization enables strong accuracy while significantly
|
| 56 |
reducing unnecessary reasoning tokens.
|
|
@@ -58,15 +63,16 @@ reducing unnecessary reasoning tokens.
|
|
| 58 |
### Scientific Reasoning Performance
|
| 59 |
|
| 60 |
Compared to instruction-only models, Innovator-VL-8B-Thinking
|
| 61 |
-
demonstrates substantial gains on:
|
| 62 |
-
|
| 63 |
-
|
|
|
|
| 64 |
|
| 65 |
------------------------------------------------------------------------
|
| 66 |
|
| 67 |
## Model Architecture
|
| 68 |
|
| 69 |
-
<img src="assets/innovator_vl_architecture.png" width="600"/>
|
| 70 |
|
| 71 |
- **Vision Encoder**: RICE-ViT (region-aware visual representation)
|
| 72 |
- **Projector**: PatchMerger for visual token compression
|
|
@@ -103,12 +109,12 @@ stage.
|
|
| 103 |
|
| 104 |
## Usage Recommendations
|
| 105 |
|
| 106 |
-
This model is recommended for:
|
| 107 |
-
|
| 108 |
-
|
|
|
|
| 109 |
|
| 110 |
-
For general instruction-following or latency-sensitive applications, the
|
| 111 |
-
Instruct version is recommended.
|
| 112 |
|
| 113 |
------------------------------------------------------------------------
|
| 114 |
|
|
@@ -154,7 +160,9 @@ messages = [
|
|
| 154 |
"type": "image",
|
| 155 |
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
|
| 156 |
},
|
| 157 |
-
{"type": "text", "text": f"{THINKING_PROMPT}
|
|
|
|
|
|
|
| 158 |
],
|
| 159 |
}
|
| 160 |
]
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
- zh
|
| 5 |
+
license: mit
|
| 6 |
+
pipeline_tag: image-text-to-text
|
| 7 |
+
library_name: transformers
|
| 8 |
---
|
| 9 |
|
| 10 |
# Innovator-VL-8B-Thinking
|
| 11 |
|
| 12 |
+
[[Paper](https://huggingface.co/papers/2601.19325)] [[Project Page](https://innovatorlm.github.io/Innovator-VL)] [[GitHub](https://github.com/InnovatorLM/Innovator-VL)] [[Demo](https://huggingface.co/spaces/InnovatorLab/Innovator-VL)]
|
| 13 |
+
|
| 14 |
## Introduction
|
| 15 |
|
| 16 |
**Innovator-VL-8B-Thinking** is a multimodal reasoning-oriented large
|
|
|
|
| 44 |
### Explicit Multimodal Reasoning
|
| 45 |
|
| 46 |
Innovator-VL-8B-Thinking is trained to explicitly generate structured
|
| 47 |
+
reasoning traces, enabling the model to:
|
| 48 |
+
- Perform multi-step logical deduction grounded in visual evidence
|
| 49 |
+
- Solve complex mathematical and scientific problems
|
| 50 |
+
- Maintain reasoning consistency across long contexts
|
| 51 |
|
| 52 |
### Reinforcement Learning for Long-Horizon Reasoning
|
| 53 |
|
| 54 |
The model is further optimized using reinforcement learning to
|
| 55 |
+
improve:
|
| 56 |
+
- Reasoning correctness
|
| 57 |
+
- Output consistency
|
| 58 |
+
- Token efficiency in long chain-of-thought generation
|
| 59 |
|
| 60 |
Sequence-level optimization enables strong accuracy while significantly
|
| 61 |
reducing unnecessary reasoning tokens.
|
|
|
|
| 63 |
### Scientific Reasoning Performance
|
| 64 |
|
| 65 |
Compared to instruction-only models, Innovator-VL-8B-Thinking
|
| 66 |
+
demonstrates substantial gains on:
|
| 67 |
+
- Multimodal mathematical reasoning benchmarks
|
| 68 |
+
- Scientific reasoning and domain-specific QA
|
| 69 |
+
- Tasks requiring precise step-by-step analysis
|
| 70 |
|
| 71 |
------------------------------------------------------------------------
|
| 72 |
|
| 73 |
## Model Architecture
|
| 74 |
|
| 75 |
+
<img src="https://huggingface.co/InnovatorLab/Innovator-VL-8B-Thinking/resolve/main/assets/innovator_vl_architecture.png" width="600"/>
|
| 76 |
|
| 77 |
- **Vision Encoder**: RICE-ViT (region-aware visual representation)
|
| 78 |
- **Projector**: PatchMerger for visual token compression
|
|
|
|
| 109 |
|
| 110 |
## Usage Recommendations
|
| 111 |
|
| 112 |
+
This model is recommended for:
|
| 113 |
+
- Multimodal mathematical reasoning
|
| 114 |
+
- Scientific problem solving requiring explicit reasoning
|
| 115 |
+
- Evaluation settings emphasizing chain-of-thought quality
|
| 116 |
|
| 117 |
+
For general instruction-following or latency-sensitive applications, the Instruct version is recommended.
|
|
|
|
| 118 |
|
| 119 |
------------------------------------------------------------------------
|
| 120 |
|
|
|
|
| 160 |
"type": "image",
|
| 161 |
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
|
| 162 |
},
|
| 163 |
+
{"type": "text", "text": f"{THINKING_PROMPT}
|
| 164 |
+
|
| 165 |
+
{question}"},
|
| 166 |
],
|
| 167 |
}
|
| 168 |
]
|