File size: 3,846 Bytes
3822129
eb3fc37
3822129
 
 
 
 
 
 
 
 
 
f986c12
3822129
 
 
6603b81
3822129
 
 
f986c12
3822129
 
 
 
 
 
 
 
67c9771
 
3822129
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67c9771
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3822129
 
 
 
 
 
 
 
 
 
6624dfa
3822129
6624dfa
 
 
3822129
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
license: mit
language:
- en
- zh
pipeline_tag: text-generation
---

# Innovator-VL-8B-Instruct

## Model Summary

**Innovator-VL-8B-Instruct** is a multimodal instruction-following large language model designed for scientific understanding and reasoning. The model integrates strong general-purpose vision-language capabilities with enhanced scientific multimodal alignment, while maintaining a fully transparent and reproducible training pipeline.

Unlike approaches that rely on large-scale domain-specific pretraining, Innovator-VL-8B-Instruct achieves competitive scientific performance using high-quality instruction tuning, without additional scientific text continued pretraining.

---

## Model Architecture

<img src="assets/innovator_vl_architecture.png" width="600"/>

- **Vision Encoder**: RICE-ViT (region-aware visual representation)  
- **Projector**: PatchMerger for visual token compression  
- **Language Model**: Qwen3-8B-Base  
- **Model Size**: 8B parameters  

The model supports native-resolution multi-image inputs and is suitable for complex scientific visual analysis.

---

## Training Overview

- **Multimodal Alignment**: LLaVA-1.5 (558K)
- **Mid-training**: LLaVA-OneVision-1.5 (85M)
- **Instruction Tuning**: High-quality multimodal and scientific instruction data (~46M)

No additional scientific text continued pretraining is applied.

---

## Intended Use

- Scientific image understanding and question answering
- Multimodal reasoning and analysis
- Interpretation of scientific figures, charts, and experimental results
- General-purpose vision-language instruction following

---

## Inference Example

Below is a minimal example to run multimodal inference (image + text) with `transformers`.

```python
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info

model_path = "InnovatorLab/Innovator-VL-8B-Instruct"

# Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# Load processor
processor = AutoProcessor.from_pretrained(
    model_path,
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# Move inputs to GPU (optional)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)

print(output_text)
```

---

## Limitations

- The Instruct version does not explicitly optimize long-chain reasoning efficiency.
- For tasks requiring structured or token-efficient reasoning, a dedicated Thinking or RL-aligned model is recommended.

---

## Citation

```bibtex
@article{wen2026innovator,
  title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery},
  author={Wen, Zichen and Yang, Boxue and Chen, Shuang and Zhang, Yaojie and Han, Yuhang and Ke, Junlong and Wang, Cong and others},
  journal={arXiv preprint arXiv:2601.19325},
  year={2026}
}