| --- |
| language: |
| - en |
| - zh |
| license: mit |
| pipeline_tag: image-text-to-text |
| library_name: transformers |
| --- |
| |
| # Innovator-VL-8B-Instruct |
|
|
| [**Paper**](https://huggingface.co/papers/2601.19325) | [**Project Page**](https://innovatorlm.github.io/Innovator-VL) | [**GitHub**](https://github.com/InnovatorLM/Innovator-VL) |
|
|
| ## Model Summary |
|
|
| **Innovator-VL-8B-Instruct** is a multimodal instruction-following large language model designed for scientific understanding and reasoning. The model integrates strong general-purpose vision-language capabilities with enhanced scientific multimodal alignment, while maintaining a fully transparent and reproducible training pipeline. |
|
|
| Unlike approaches that rely on large-scale domain-specific pretraining, Innovator-VL-8B-Instruct achieves competitive scientific performance using high-quality instruction tuning, without additional scientific text continued pretraining. |
|
|
| --- |
|
|
| ## Model Architecture |
|
|
| <img src="assets/innovator_vl_architecture.png" width="600"/> |
|
|
| - **Vision Encoder**: RICE-ViT (region-aware visual representation) |
| - **Projector**: PatchMerger for visual token compression |
| - **Language Model**: Qwen3-8B-Base |
| - **Model Size**: 8B parameters |
|
|
| The model supports native-resolution multi-image inputs and is suitable for complex scientific visual analysis. |
|
|
| --- |
|
|
| ## Training Overview |
|
|
| - **Multimodal Alignment**: LLaVA-1.5 (558K) |
| - **Mid-training**: LLaVA-OneVision-1.5 (85M) |
| - **Instruction Tuning**: High-quality multimodal and scientific instruction data (~46M) |
|
|
| No additional scientific text continued pretraining is applied. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| - Scientific image understanding and question answering |
| - Multimodal reasoning and analysis |
| - Interpretation of scientific figures, charts, and experimental results |
| - General-purpose vision-language instruction following |
|
|
| --- |
|
|
| ## Inference Example |
|
|
| Below is a minimal example to run multimodal inference (image + text) with `transformers`. |
|
|
| ```python |
| import torch |
| from transformers import AutoProcessor, AutoModelForCausalLM |
| from qwen_vl_utils import process_vision_info |
| |
| model_path = "InnovatorLab/Innovator-VL-8B-Instruct" |
| |
| # Load the model on the available device(s) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_path, |
| torch_dtype="auto", |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| # Load processor |
| processor = AutoProcessor.from_pretrained( |
| model_path, |
| trust_remote_code=True, |
| ) |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| { |
| "type": "image", |
| "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", |
| }, |
| {"type": "text", "text": "Describe this image."}, |
| ], |
| } |
| ] |
| |
| # Preparation for inference |
| text = processor.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| ) |
| |
| image_inputs, video_inputs = process_vision_info(messages) |
| |
| inputs = processor( |
| text=[text], |
| images=image_inputs, |
| videos=video_inputs, |
| padding=True, |
| return_tensors="pt", |
| ) |
| |
| # Move inputs to GPU (optional) |
| inputs = inputs.to("cuda") |
| |
| # Inference: Generation of the output |
| generated_ids = model.generate(**inputs, max_new_tokens=1024) |
| |
| generated_ids_trimmed = [ |
| out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| ] |
| |
| output_text = processor.batch_decode( |
| generated_ids_trimmed, |
| skip_special_tokens=True, |
| clean_up_tokenization_spaces=False, |
| ) |
| |
| print(output_text) |
| ``` |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - The Instruct version does not explicitly optimize long-chain reasoning efficiency. |
| - For tasks requiring structured or token-efficient reasoning, a dedicated Thinking or RL-aligned model is recommended. |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{wen2026innovator, |
| title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery}, |
| author={Wen, Zichen and Yang, Boxue and Bird, Shuang and Zhang, Yaojie and Han, Yuhang and Ke, Junlong and Wang, Cong and others}, |
| journal={arXiv preprint arXiv:2601.19325}, |
| year={2026} |
| } |
| ``` |