LaViT-3B: Aligning Latent Visual Thoughts for Multi-modal Reasoning
LaViT is a vision-language model that aligns latent visual thoughts for enhanced multi-modal reasoning.
π Overview
LaViT (Latent Visual Thoughts) addresses a critical Perception Gap in multimodal latent reasoning: student models often mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception.
To bridge this gap, LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning.
Key Features
- π― Visual Grounding: Significantly enhanced visual grounding capabilities
- π§ Multi-modal Reasoning: Improved performance on complex reasoning tasks
- π Efficient: Compact 3B model that outperforms larger open-source variants
- π State-of-the-art: Achieves up to +16.9% gains on complex reasoning tasks
π Paper
Title: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
Authors: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung
Paper Link: arXiv:2601.10129
Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
π Usage
Installation
pip install transformers torch pillow
Basic Usage
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
# Load model and processor
processor = AutoProcessor.from_pretrained("Svard/LaViT-3B")
model = AutoModelForCausalLM.from_pretrained("Svard/LaViT-3B")
# Load image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Prepare prompt
prompt = "What is in this image?"
# Process inputs
inputs = processor(images=image, text=prompt, return_tensors="pt")
# Generate response
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
Advanced Usage with Visual Reasoning
For tasks requiring visual reasoning, you can use the <lvr> (Latent Visual Reasoning) tokens:
prompt = "Analyze this image step by step: <lvr> What objects are present? <lvr> What are their spatial relationships? <lvr>"
π Performance
LaViT-3B achieves significant improvements on various benchmarks:
- MMVP: Enhanced performance on multi-modal visual perception tasks
- BLINK: Improved results on visual reasoning benchmarks
- Visual Grounding: Up to +16.9% gains on complex reasoning tasks
ποΈ Model Architecture
- Base Model: Qwen2.5-VL-3B-Instruct
- Parameters: 3B
- Training Method: Visual thought trajectory supervision
- Key Innovation: Latent visual thought alignment with curriculum sensory gating
π Citation
If you find this model useful in your research, please cite:
@misc{wu2026lavitaligninglatentvisual,
title={LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning},
author={Linquan Wu and Tianxiang Jiang and Yifei Dong and Haoyu Yang and Fengji Zhang and Shichaang Meng and Ai Xuan and Linqi Song and Jacky Keung},
year={2026},
eprint={2601.10129},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.10129},
}
π License
This model is licensed under the Apache-2.0 License.
π Acknowledgments
This model is built upon Qwen2.5-VL and inspired by the LVR (Latent Visual Reasoning) framework. We thank the open-source community for their valuable contributions.
π Related Links
- Paper: arXiv:2601.10129
- Code Repository: GitHub
- Base Model: Qwen2.5-VL-3B-Instruct
- Downloads last month
- -
Model tree for Svard/LaViT-3B
Base model
Qwen/Qwen2.5-VL-3B-Instruct