--- license: mit language: - en - zh pipeline_tag: text-generation --- # Innovator-VL-8B-Thinking ## Introduction **Innovator-VL-8B-Thinking** is a multimodal reasoning-oriented large language model designed for complex scientific problem solving. Built upon Innovator-VL-8B-Instruct, this model is further optimized for explicit multi-step reasoning, long-horizon chain-of-thought generation, and token-efficient scientific analysis. The model is particularly suitable for scientific tasks that require structured reasoning over visual and textual evidence, such as mathematics, chemistry, materials science, and multimodal scientific benchmarks. ------------------------------------------------------------------------ ## Model Overview - **Model Type**: Vision-Language Reasoning Model - **Parameter Size**: 8B - **Base Language Model**: Qwen3-8B-Base - **Vision Encoder**: RICE-ViT - **Projector**: PatchMerger The model supports native-resolution multi-image inputs and is optimized for reasoning-intensive multimodal scenarios. ------------------------------------------------------------------------ ## Key Characteristics ### Explicit Multimodal Reasoning Innovator-VL-8B-Thinking is trained to explicitly generate structured reasoning traces, enabling the model to: - Perform multi-step logical deduction grounded in visual evidence - Solve complex mathematical and scientific problems - Maintain reasoning consistency across long contexts ### Reinforcement Learning for Long-Horizon Reasoning The model is further optimized using reinforcement learning to improve: - Reasoning correctness - Output consistency - Token efficiency in long chain-of-thought generation Sequence-level optimization enables strong accuracy while significantly reducing unnecessary reasoning tokens. ### Scientific Reasoning Performance Compared to instruction-only models, Innovator-VL-8B-Thinking demonstrates substantial gains on: - Multimodal mathematical reasoning benchmarks - Scientific reasoning and domain-specific QA - Tasks requiring precise step-by-step analysis ------------------------------------------------------------------------ ## Model Architecture - **Vision Encoder**: RICE-ViT (region-aware visual representation) - **Projector**: PatchMerger for visual token compression - **Language Model**: Qwen3-8B-Base - **Model Size**: 8B parameters The architecture is shared with the Instruct variant, while the optimization objective and training strategy differ at the post-training stage. ------------------------------------------------------------------------ ## Training Pipeline ### Multimodal Pre-training - Vision-language alignment with LLaVA-1.5 (558K) - Full-parameter mid-training using LLaVA-OneVision-1.5 (85M) ### Instruction Initialization - Initialized from Innovator-VL-8B-Instruct - Supervised fine-tuning with multimodal instruction and reasoning data ### Reinforcement Learning - Trained with Innovator-VL-RL-172K - Optimized using Group Sequence Policy Optimization (GSPO) - Reward design jointly considers reasoning structure and answer correctness ------------------------------------------------------------------------ ## Usage Recommendations This model is recommended for: - Multimodal mathematical reasoning - Scientific problem solving requiring explicit reasoning - Evaluation settings emphasizing chain-of-thought quality For general instruction-following or latency-sensitive applications, the Instruct version is recommended. ------------------------------------------------------------------------ ## Inference Example (Thinking Prompt) Below is a minimal example to run multimodal inference (image + text) with a thinking-style prompt. ```python import torch from transformers import AutoProcessor, AutoModelForCausalLM from qwen_vl_utils import process_vision_info model_path = "InnovatorLab/Innovator-VL-8B-Thinking" THINKING_PROMPT = ( "Think and solve the following question step by step. " "Please put your thinking and analysis procedure within . " "Put ONLY your final answer within ." ) # Load the model on the available device(s) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True, ) # Load processor processor = AutoProcessor.from_pretrained( model_path, trust_remote_code=True, ) question = "Describe this image." messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": f"{THINKING_PROMPT}\n\n{question}"}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) # Move inputs to GPU (optional) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False, ) print(output_text) ``` ------------------------------------------------------------------------ ## Citation ```bibtex @article{wen2026innovator, title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery}, author={Wen, Zichen and Yang, Boxue and Chen, Shuang and Zhang, Yaojie and Han, Yuhang and Ke, Junlong and Wang, Cong and others}, journal={arXiv preprint arXiv:2601.19325}, year={2026} } ```