---
license: mit
language:
- en
- zh
pipeline_tag: text-generation
---

# Innovator-VL-8B-Thinking

## Introduction

**Innovator-VL-8B-Thinking** is a multimodal reasoning-oriented large
language model designed for complex scientific problem solving. Built
upon Innovator-VL-8B-Instruct, this model is further optimized for
explicit multi-step reasoning, long-horizon chain-of-thought generation,
and token-efficient scientific analysis.

The model is particularly suitable for scientific tasks that require
structured reasoning over visual and textual evidence, such as
mathematics, chemistry, materials science, and multimodal scientific
benchmarks.

------------------------------------------------------------------------

## Model Overview

-   **Model Type**: Vision-Language Reasoning Model
-   **Parameter Size**: 8B
-   **Base Language Model**: Qwen3-8B-Base
-   **Vision Encoder**: RICE-ViT
-   **Projector**: PatchMerger

The model supports native-resolution multi-image inputs and is optimized
for reasoning-intensive multimodal scenarios.

------------------------------------------------------------------------

## Key Characteristics

### Explicit Multimodal Reasoning

Innovator-VL-8B-Thinking is trained to explicitly generate structured
reasoning traces, enabling the model to: - Perform multi-step logical
deduction grounded in visual evidence - Solve complex mathematical and
scientific problems - Maintain reasoning consistency across long
contexts

### Reinforcement Learning for Long-Horizon Reasoning

The model is further optimized using reinforcement learning to
improve: - Reasoning correctness - Output consistency - Token efficiency
in long chain-of-thought generation

Sequence-level optimization enables strong accuracy while significantly
reducing unnecessary reasoning tokens.

### Scientific Reasoning Performance

Compared to instruction-only models, Innovator-VL-8B-Thinking
demonstrates substantial gains on: - Multimodal mathematical reasoning
benchmarks - Scientific reasoning and domain-specific QA - Tasks
requiring precise step-by-step analysis

------------------------------------------------------------------------

## Model Architecture

<img src="assets/innovator_vl_architecture.png" width="600"/>

-   **Vision Encoder**: RICE-ViT (region-aware visual representation)
-   **Projector**: PatchMerger for visual token compression
-   **Language Model**: Qwen3-8B-Base
-   **Model Size**: 8B parameters

The architecture is shared with the Instruct variant, while the
optimization objective and training strategy differ at the post-training
stage.

------------------------------------------------------------------------

## Training Pipeline

### Multimodal Pre-training

-   Vision-language alignment with LLaVA-1.5 (558K)
-   Full-parameter mid-training using LLaVA-OneVision-1.5 (85M)

### Instruction Initialization

-   Initialized from Innovator-VL-8B-Instruct
-   Supervised fine-tuning with multimodal instruction and reasoning
    data

### Reinforcement Learning

-   Trained with Innovator-VL-RL-172K
-   Optimized using Group Sequence Policy Optimization (GSPO)
-   Reward design jointly considers reasoning structure and answer
    correctness

------------------------------------------------------------------------

## Usage Recommendations

This model is recommended for: - Multimodal mathematical reasoning -
Scientific problem solving requiring explicit reasoning - Evaluation
settings emphasizing chain-of-thought quality

For general instruction-following or latency-sensitive applications, the
Instruct version is recommended.

------------------------------------------------------------------------

## Inference Example (Thinking Prompt)

Below is a minimal example to run multimodal inference (image + text)
with a thinking-style prompt.

```python
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info

model_path = "InnovatorLab/Innovator-VL-8B-Thinking"

THINKING_PROMPT = (
    "Think and solve the following question step by step. "
    "Please put your thinking and analysis procedure within <think></think>. "
    "Put ONLY your final answer within <answer></answer>."
)

# Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

# Load processor
processor = AutoProcessor.from_pretrained(
    model_path,
    trust_remote_code=True,
)

question = "Describe this image."

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": f"{THINKING_PROMPT}\n\n{question}"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# Move inputs to GPU (optional)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)

print(output_text)
```
------------------------------------------------------------------------

## Citation 
```bibtex
@article{wen2026innovator,
  title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery},
  author={Wen, Zichen and Yang, Boxue and Chen, Shuang and Zhang, Yaojie and Han, Yuhang and Ke, Junlong and Wang, Cong and others},
  journal={arXiv preprint arXiv:2601.19325},
  year={2026}
}
```