File size: 6,143 Bytes
a222f68 4408865 a222f68 9b1ad9a a222f68 713c89f a222f68 f6700d0 a7b4ee8 e0c507c 99f2f73 0d91664 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
---
license: mit
language:
- en
- zh
pipeline_tag: text-generation
---
# Innovator-VL-8B-Thinking
[Paper](https://huggingface.co/papers/2601.19325) | [Code](https://github.com/InnovatorLM/Innovator-VL)
## Introduction
**Innovator-VL-8B-Thinking** is a multimodal reasoning-oriented large
language model designed for complex scientific problem solving. Built
upon Innovator-VL-8B-Instruct, this model is further optimized for
explicit multi-step reasoning, long-horizon chain-of-thought generation,
and token-efficient scientific analysis.
The model is particularly suitable for scientific tasks that require
structured reasoning over visual and textual evidence, such as
mathematics, chemistry, materials science, and multimodal scientific
benchmarks.
------------------------------------------------------------------------
## Model Overview
- **Model Type**: Vision-Language Reasoning Model
- **Parameter Size**: 8B
- **Base Language Model**: Qwen3-8B-Base
- **Vision Encoder**: RICE-ViT
- **Projector**: PatchMerger
The model supports native-resolution multi-image inputs and is optimized
for reasoning-intensive multimodal scenarios.
------------------------------------------------------------------------
## Key Characteristics
### Explicit Multimodal Reasoning
Innovator-VL-8B-Thinking is trained to explicitly generate structured
reasoning traces, enabling the model to: - Perform multi-step logical
deduction grounded in visual evidence - Solve complex mathematical and
scientific problems - Maintain reasoning consistency across long
contexts
### Reinforcement Learning for Long-Horizon Reasoning
The model is further optimized using reinforcement learning to
improve: - Reasoning correctness - Output consistency - Token efficiency
in long chain-of-thought generation
Sequence-level optimization enables strong accuracy while significantly
reducing unnecessary reasoning tokens.
### Scientific Reasoning Performance
Compared to instruction-only models, Innovator-VL-8B-Thinking
demonstrates substantial gains on: - Multimodal mathematical reasoning
benchmarks - Scientific reasoning and domain-specific QA - Tasks
requiring precise step-by-step analysis
------------------------------------------------------------------------
## Model Architecture
<img src="assets/innovator_vl_architecture.png" width="600"/>
- **Vision Encoder**: RICE-ViT (region-aware visual representation)
- **Projector**: PatchMerger for visual token compression
- **Language Model**: Qwen3-8B-Base
- **Model Size**: 8B parameters
The architecture is shared with the Instruct variant, while the
optimization objective and training strategy differ at the post-training
stage.
------------------------------------------------------------------------
## Training Pipeline
### Multimodal Pre-training
- Vision-language alignment with LLaVA-1.5 (558K)
- Full-parameter mid-training using LLaVA-OneVision-1.5 (85M)
### Instruction Initialization
- Initialized from Innovator-VL-8B-Instruct
- Supervised fine-tuning with multimodal instruction and reasoning
data
### Reinforcement Learning
- Trained with Innovator-VL-RL-172K
- Optimized using Group Sequence Policy Optimization (GSPO)
- Reward design jointly considers reasoning structure and answer
correctness
------------------------------------------------------------------------
## Usage Recommendations
This model is recommended for: - Multimodal mathematical reasoning -
Scientific problem solving requiring explicit reasoning - Evaluation
settings emphasizing chain-of-thought quality
For general instruction-following or latency-sensitive applications, the
Instruct version is recommended.
------------------------------------------------------------------------
## Inference Example (Thinking Prompt)
Below is a minimal example to run multimodal inference (image + text)
with a thinking-style prompt.
```python
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info
model_path = "InnovatorLab/Innovator-VL-8B-Thinking"
THINKING_PROMPT = (
"Think and solve the following question step by step. "
"Please put your thinking and analysis procedure within <think></think>. "
"Put ONLY your final answer within <answer></answer>."
)
# Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
# Load processor
processor = AutoProcessor.from_pretrained(
model_path,
trust_remote_code=True,
)
question = "Describe this image."
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": f"{THINKING_PROMPT}\n\n{question}"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
# Move inputs to GPU (optional)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(output_text)
```
------------------------------------------------------------------------
## Citation
```bibtex
@article{wen2026innovator,
title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery},
author={Wen, Zichen and Yang, Boxue and Chen, Shuang and Zhang, Yaojie and Han, Yuhang and Ke, Junlong and Wang, Cong and others},
journal={arXiv preprint arXiv:2601.19325},
year={2026}
}
``` |