File size: 3,846 Bytes
3822129 eb3fc37 3822129 f986c12 3822129 6603b81 3822129 f986c12 3822129 67c9771 3822129 67c9771 3822129 6624dfa 3822129 6624dfa 3822129 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
license: mit
language:
- en
- zh
pipeline_tag: text-generation
---
# Innovator-VL-8B-Instruct
## Model Summary
**Innovator-VL-8B-Instruct** is a multimodal instruction-following large language model designed for scientific understanding and reasoning. The model integrates strong general-purpose vision-language capabilities with enhanced scientific multimodal alignment, while maintaining a fully transparent and reproducible training pipeline.
Unlike approaches that rely on large-scale domain-specific pretraining, Innovator-VL-8B-Instruct achieves competitive scientific performance using high-quality instruction tuning, without additional scientific text continued pretraining.
---
## Model Architecture
<img src="assets/innovator_vl_architecture.png" width="600"/>
- **Vision Encoder**: RICE-ViT (region-aware visual representation)
- **Projector**: PatchMerger for visual token compression
- **Language Model**: Qwen3-8B-Base
- **Model Size**: 8B parameters
The model supports native-resolution multi-image inputs and is suitable for complex scientific visual analysis.
---
## Training Overview
- **Multimodal Alignment**: LLaVA-1.5 (558K)
- **Mid-training**: LLaVA-OneVision-1.5 (85M)
- **Instruction Tuning**: High-quality multimodal and scientific instruction data (~46M)
No additional scientific text continued pretraining is applied.
---
## Intended Use
- Scientific image understanding and question answering
- Multimodal reasoning and analysis
- Interpretation of scientific figures, charts, and experimental results
- General-purpose vision-language instruction following
---
## Inference Example
Below is a minimal example to run multimodal inference (image + text) with `transformers`.
```python
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info
model_path = "InnovatorLab/Innovator-VL-8B-Instruct"
# Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
# Load processor
processor = AutoProcessor.from_pretrained(
model_path,
trust_remote_code=True,
)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
# Move inputs to GPU (optional)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(output_text)
```
---
## Limitations
- The Instruct version does not explicitly optimize long-chain reasoning efficiency.
- For tasks requiring structured or token-efficient reasoning, a dedicated Thinking or RL-aligned model is recommended.
---
## Citation
```bibtex
@article{wen2026innovator,
title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery},
author={Wen, Zichen and Yang, Boxue and Chen, Shuang and Zhang, Yaojie and Han, Yuhang and Ke, Junlong and Wang, Cong and others},
journal={arXiv preprint arXiv:2601.19325},
year={2026}
} |