Update README.md

99f2f73 verified 7 days ago

6.04 kB

	---
	license: mit
	language:
	- en
	- zh
	pipeline_tag: text-generation
	---

	# Innovator-VL-8B-Thinking

	## Introduction

	Innovator-VL-8B-Thinking is a multimodal reasoning-oriented large
	language model designed for complex scientific problem solving. Built
	upon Innovator-VL-8B-Instruct, this model is further optimized for
	explicit multi-step reasoning, long-horizon chain-of-thought generation,
	and token-efficient scientific analysis.

	The model is particularly suitable for scientific tasks that require
	structured reasoning over visual and textual evidence, such as
	mathematics, chemistry, materials science, and multimodal scientific
	benchmarks.

	------------------------------------------------------------------------

	## Model Overview

	- Model Type: Vision-Language Reasoning Model
	- Parameter Size: 8B
	- Base Language Model: Qwen3-8B-Base
	- Vision Encoder: RICE-ViT
	- Projector: PatchMerger

	The model supports native-resolution multi-image inputs and is optimized
	for reasoning-intensive multimodal scenarios.

	------------------------------------------------------------------------

	## Key Characteristics

	### Explicit Multimodal Reasoning

	Innovator-VL-8B-Thinking is trained to explicitly generate structured
	reasoning traces, enabling the model to: - Perform multi-step logical
	deduction grounded in visual evidence - Solve complex mathematical and
	scientific problems - Maintain reasoning consistency across long
	contexts

	### Reinforcement Learning for Long-Horizon Reasoning

	The model is further optimized using reinforcement learning to
	improve: - Reasoning correctness - Output consistency - Token efficiency
	in long chain-of-thought generation

	Sequence-level optimization enables strong accuracy while significantly
	reducing unnecessary reasoning tokens.

	### Scientific Reasoning Performance

	Compared to instruction-only models, Innovator-VL-8B-Thinking
	demonstrates substantial gains on: - Multimodal mathematical reasoning
	benchmarks - Scientific reasoning and domain-specific QA - Tasks
	requiring precise step-by-step analysis

	------------------------------------------------------------------------

	## Model Architecture

	<img src="assets/innovator_vl_architecture.png" width="600"/>

	- Vision Encoder: RICE-ViT (region-aware visual representation)
	- Projector: PatchMerger for visual token compression
	- Language Model: Qwen3-8B-Base
	- Model Size: 8B parameters

	The architecture is shared with the Instruct variant, while the
	optimization objective and training strategy differ at the post-training
	stage.

	------------------------------------------------------------------------

	## Training Pipeline

	### Multimodal Pre-training

	- Vision-language alignment with LLaVA-1.5 (558K)
	- Full-parameter mid-training using LLaVA-OneVision-1.5 (85M)

	### Instruction Initialization

	- Initialized from Innovator-VL-8B-Instruct
	- Supervised fine-tuning with multimodal instruction and reasoning
	data

	### Reinforcement Learning

	- Trained with Innovator-VL-RL-172K
	- Optimized using Group Sequence Policy Optimization (GSPO)
	- Reward design jointly considers reasoning structure and answer
	correctness

	------------------------------------------------------------------------

	## Usage Recommendations

	This model is recommended for: - Multimodal mathematical reasoning -
	Scientific problem solving requiring explicit reasoning - Evaluation
	settings emphasizing chain-of-thought quality

	For general instruction-following or latency-sensitive applications, the
	Instruct version is recommended.

	------------------------------------------------------------------------

	## Inference Example (Thinking Prompt)

	Below is a minimal example to run multimodal inference (image + text)
	with a thinking-style prompt.

	```python
	import torch
	from transformers import AutoProcessor, AutoModelForCausalLM
	from qwen_vl_utils import process_vision_info

	model_path = "InnovatorLab/Innovator-VL-8B-Thinking"

	THINKING_PROMPT = (
	"Think and solve the following question step by step. "
	"Please put your thinking and analysis procedure within <think></think>. "
	"Put ONLY your final answer within <answer></answer>."
	)

	# Load the model on the available device(s)
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True,
	)

	# Load processor
	processor = AutoProcessor.from_pretrained(
	model_path,
	trust_remote_code=True,
	)

	question = "Describe this image."

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": f"{THINKING_PROMPT}\n\n{question}"},
	],
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)

	image_inputs, video_inputs = process_vision_info(messages)

	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)

	# Move inputs to GPU (optional)
	inputs = inputs.to("cuda")

	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=1024)

	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]

	output_text = processor.batch_decode(
	generated_ids_trimmed,
	skip_special_tokens=True,
	clean_up_tokenization_spaces=False,
	)

	print(output_text)
	```
	------------------------------------------------------------------------

	## Citation
	```bibtex
	@article{wen2026innovator,
	title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery},
	author={Wen, Zichen and Yang, Boxue and Chen, Shuang and Zhang, Yaojie and Han, Yuhang and Ke, Junlong and Wang, Cong and others},
	journal={arXiv preprint arXiv:2601.19325},
	year={2026}
	}
	```