---
base_model: Qwen/Qwen2.5-VL-3B-Instruct
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- vision-language
- multimodal
- reasoning
- visual-grounding
- computer-vision
---

# LaViT-3B: Aligning Latent Visual Thoughts for Multi-modal Reasoning

<div align="center\">

**LaViT** is a vision-language model that aligns latent visual thoughts for enhanced multi-modal reasoning.

[![Paper](https://img.shields.io/badge/Paper-arXiv:2601.10129-b31b1b.svg)](https://arxiv.org/abs/2601.10129)
[![Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow.svg)](https://huggingface.co/Svard/LaViT-3B)
[![GitHub](https://img.shields.io/badge/GitHub-Code-blue?logo=github)](https://github.com/Svardfox/LaViT)

</div>

## 📖 Overview

**LaViT** (Latent Visual Thoughts) addresses a critical **Perception Gap** in multimodal latent reasoning: student models often mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception.

To bridge this gap, LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning.

### Key Features

- 🎯 **Visual Grounding**: Significantly enhanced visual grounding capabilities
- 🧠 **Multi-modal Reasoning**: Improved performance on complex reasoning tasks
- 📊 **Efficient**: Compact 3B model that outperforms larger open-source variants
- 🚀 **State-of-the-art**: Achieves up to +16.9% gains on complex reasoning tasks

## 📄 Paper

**Title**: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

**Authors**: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung

**Paper Link**: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129)

**Abstract**: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.

## 🚀 Usage

### Installation

```bash
pip install transformers torch pillow
```

### Basic Usage

```python
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests

# Load model and processor
processor = AutoProcessor.from_pretrained("Svard/LaViT-3B")
model = AutoModelForCausalLM.from_pretrained("Svard/LaViT-3B")

# Load image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Prepare prompt
prompt = "What is in this image?"

# Process inputs
inputs = processor(images=image, text=prompt, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Advanced Usage with Visual Reasoning

For tasks requiring visual reasoning, you can use the `<lvr>` (Latent Visual Reasoning) tokens:

```python
prompt = "Analyze this image step by step: <lvr> What objects are present? <lvr> What are their spatial relationships? <lvr>"
```

## 📊 Performance

LaViT-3B achieves significant improvements on various benchmarks:

- **MMVP**: Enhanced performance on multi-modal visual perception tasks
- **BLINK**: Improved results on visual reasoning benchmarks
- **Visual Grounding**: Up to +16.9% gains on complex reasoning tasks

## 🏗️ Model Architecture

- **Base Model**: Qwen2.5-VL-3B-Instruct
- **Parameters**: 3B
- **Training Method**: Visual thought trajectory supervision
- **Key Innovation**: Latent visual thought alignment with curriculum sensory gating

## 📝 Citation

If you find this model useful in your research, please cite:

```bibtex
@misc{wu2026lavitaligninglatentvisual,
      title={LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning}, 
      author={Linquan Wu and Tianxiang Jiang and Yifei Dong and Haoyu Yang and Fengji Zhang and Shichaang Meng and Ai Xuan and Linqi Song and Jacky Keung},
      year={2026},
      eprint={2601.10129},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.10129}, 
}
```

## 📄 License

This model is licensed under the Apache-2.0 License.

## 🙏 Acknowledgments

This model is built upon [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL) and inspired by the [LVR (Latent Visual Reasoning)](https://github.com/VincentLeebang/lvr) framework. We thank the open-source community for their valuable contributions.

## 🔗 Related Links

- **Paper**: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129)
- **Code Repository**: [GitHub](https://github.com/Svardfox/LaViT)
- **Base Model**: [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)