LaViT-3B: Aligning Latent Visual Thoughts for Multi-modal Reasoning

LaViT is a vision-language model that aligns latent visual thoughts for enhanced multi-modal reasoning.

Paper Model

πŸ“– Overview

LaViT (Latent Visual Thoughts) addresses a critical Perception Gap in multimodal latent reasoning: student models often mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception.

To bridge this gap, LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning.

Key Features

  • 🎯 Visual Grounding: Significantly enhanced visual grounding capabilities
  • 🧠 Multi-modal Reasoning: Improved performance on complex reasoning tasks
  • πŸ“Š Efficient: Compact 3B model that outperforms larger open-source variants
  • πŸš€ State-of-the-art: Achieves up to +16.9% gains on complex reasoning tasks

πŸ“„ Paper

Title: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Authors: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung

Paper Link: arXiv:2601.10129

Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.

πŸš€ Usage

Installation

pip install transformers torch pillow

Basic Usage

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests

# Load model and processor
processor = AutoProcessor.from_pretrained("Svard/LaViT-3B")
model = AutoModelForCausalLM.from_pretrained("Svard/LaViT-3B")

# Load image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Prepare prompt
prompt = "What is in this image?"

# Process inputs
inputs = processor(images=image, text=prompt, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Advanced Usage with Visual Reasoning

For tasks requiring visual reasoning, you can use the <lvr> (Latent Visual Reasoning) tokens:

prompt = "Analyze this image step by step: <lvr> What objects are present? <lvr> What are their spatial relationships? <lvr>"

πŸ“Š Performance

LaViT-3B achieves significant improvements on various benchmarks:

  • MMVP: Enhanced performance on multi-modal visual perception tasks
  • BLINK: Improved results on visual reasoning benchmarks
  • Visual Grounding: Up to +16.9% gains on complex reasoning tasks

πŸ—οΈ Model Architecture

  • Base Model: Qwen2.5-VL-3B-Instruct
  • Parameters: 3B
  • Training Method: Visual thought trajectory supervision
  • Key Innovation: Latent visual thought alignment with curriculum sensory gating

πŸ“ Citation

If you find this model useful in your research, please cite:

@misc{wu2026lavitaligninglatentvisual,
      title={LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning}, 
      author={Linquan Wu and Tianxiang Jiang and Yifei Dong and Haoyu Yang and Fengji Zhang and Shichaang Meng and Ai Xuan and Linqi Song and Jacky Keung},
      year={2026},
      eprint={2601.10129},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.10129}, 
}

πŸ“„ License

This model is licensed under the Apache-2.0 License.

πŸ™ Acknowledgments

This model is built upon Qwen2.5-VL and inspired by the LVR (Latent Visual Reasoning) framework. We thank the open-source community for their valuable contributions.

πŸ”— Related Links

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Svard/LaViT-3B

Finetuned
(628)
this model

Paper for Svard/LaViT-3B