File size: 6,143 Bytes
a222f68
4408865
a222f68
 
 
 
 
 
 
 
9b1ad9a
 
a222f68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
713c89f
a222f68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6700d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a7b4ee8
e0c507c
99f2f73
 
 
 
 
 
0d91664
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
license: mit
language:
- en
- zh
pipeline_tag: text-generation
---

# Innovator-VL-8B-Thinking

[Paper](https://huggingface.co/papers/2601.19325) | [Code](https://github.com/InnovatorLM/Innovator-VL)

## Introduction

**Innovator-VL-8B-Thinking** is a multimodal reasoning-oriented large
language model designed for complex scientific problem solving. Built
upon Innovator-VL-8B-Instruct, this model is further optimized for
explicit multi-step reasoning, long-horizon chain-of-thought generation,
and token-efficient scientific analysis.

The model is particularly suitable for scientific tasks that require
structured reasoning over visual and textual evidence, such as
mathematics, chemistry, materials science, and multimodal scientific
benchmarks.

------------------------------------------------------------------------

## Model Overview

-   **Model Type**: Vision-Language Reasoning Model
-   **Parameter Size**: 8B
-   **Base Language Model**: Qwen3-8B-Base
-   **Vision Encoder**: RICE-ViT
-   **Projector**: PatchMerger

The model supports native-resolution multi-image inputs and is optimized
for reasoning-intensive multimodal scenarios.

------------------------------------------------------------------------

## Key Characteristics

### Explicit Multimodal Reasoning

Innovator-VL-8B-Thinking is trained to explicitly generate structured
reasoning traces, enabling the model to: - Perform multi-step logical
deduction grounded in visual evidence - Solve complex mathematical and
scientific problems - Maintain reasoning consistency across long
contexts

### Reinforcement Learning for Long-Horizon Reasoning

The model is further optimized using reinforcement learning to
improve: - Reasoning correctness - Output consistency - Token efficiency
in long chain-of-thought generation

Sequence-level optimization enables strong accuracy while significantly
reducing unnecessary reasoning tokens.

### Scientific Reasoning Performance

Compared to instruction-only models, Innovator-VL-8B-Thinking
demonstrates substantial gains on: - Multimodal mathematical reasoning
benchmarks - Scientific reasoning and domain-specific QA - Tasks
requiring precise step-by-step analysis

------------------------------------------------------------------------

## Model Architecture

<img src="assets/innovator_vl_architecture.png" width="600"/>

-   **Vision Encoder**: RICE-ViT (region-aware visual representation)
-   **Projector**: PatchMerger for visual token compression
-   **Language Model**: Qwen3-8B-Base
-   **Model Size**: 8B parameters

The architecture is shared with the Instruct variant, while the
optimization objective and training strategy differ at the post-training
stage.

------------------------------------------------------------------------

## Training Pipeline

### Multimodal Pre-training

-   Vision-language alignment with LLaVA-1.5 (558K)
-   Full-parameter mid-training using LLaVA-OneVision-1.5 (85M)

### Instruction Initialization

-   Initialized from Innovator-VL-8B-Instruct
-   Supervised fine-tuning with multimodal instruction and reasoning
    data

### Reinforcement Learning

-   Trained with Innovator-VL-RL-172K
-   Optimized using Group Sequence Policy Optimization (GSPO)
-   Reward design jointly considers reasoning structure and answer
    correctness

------------------------------------------------------------------------

## Usage Recommendations

This model is recommended for: - Multimodal mathematical reasoning -
Scientific problem solving requiring explicit reasoning - Evaluation
settings emphasizing chain-of-thought quality

For general instruction-following or latency-sensitive applications, the
Instruct version is recommended.

------------------------------------------------------------------------

## Inference Example (Thinking Prompt)

Below is a minimal example to run multimodal inference (image + text)
with a thinking-style prompt.

```python
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info

model_path = "InnovatorLab/Innovator-VL-8B-Thinking"

THINKING_PROMPT = (
    "Think and solve the following question step by step. "
    "Please put your thinking and analysis procedure within <think></think>. "
    "Put ONLY your final answer within <answer></answer>."
)

# Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# Load processor
processor = AutoProcessor.from_pretrained(
    model_path,
    trust_remote_code=True,
)

question = "Describe this image."

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": f"{THINKING_PROMPT}\n\n{question}"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# Move inputs to GPU (optional)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)

print(output_text)
```
------------------------------------------------------------------------

## Citation 
```bibtex
@article{wen2026innovator,
  title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery},
  author={Wen, Zichen and Yang, Boxue and Chen, Shuang and Zhang, Yaojie and Han, Yuhang and Ke, Junlong and Wang, Cong and others},
  journal={arXiv preprint arXiv:2601.19325},
  year={2026}
}
```