Svard
/

LaViT-3B

+---
+license: apache-2.0
+base_model: Qwen/Qwen2.5-VL-3B-Instruct
+tags:
+- vision-language
+- multimodal
+- reasoning
+- visual-grounding
+- computer-vision
+pipeline_tag: visual-question-answering
+---
+# LaViT-3B: Aligning Latent Visual Thoughts for Multi-modal Reasoning
+<div align="center">
+**LaViT** is a vision-language model that aligns latent visual thoughts for enhanced multi-modal reasoning.
+[![Paper](https://img.shields.io/badge/Paper-arXiv:2601.10129-b31b1b.svg)](https://arxiv.org/abs/2601.10129)
+[![Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow.svg)](https://huggingface.co/Svard/LaViT-3B)
+</div>
+## 📖 Overview
+**LaViT** (Latent Visual Thoughts) addresses a critical **Perception Gap** in multimodal latent reasoning: student models often mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception.
+To bridge this gap, LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning.
+### Key Features
+- 🎯 **Visual Grounding**: Significantly enhanced visual grounding capabilities
+- 🧠 **Multi-modal Reasoning**: Improved performance on complex reasoning tasks
+- 📊 **Efficient**: Compact 3B model that outperforms larger open-source variants
+- 🚀 **State-of-the-art**: Achieves up to +16.9% gains on complex reasoning tasks
+## 📄 Paper
+**Title**: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
+**Authors**: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung
+**Paper Link**: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129)
+**Abstract**: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
+## 🚀 Usage
+### Installation
+```bash
+pip install transformers torch pillow
+```
+### Basic Usage
+```python
+from transformers import AutoProcessor, AutoModelForCausalLM
+from PIL import Image
+import requests
+# Load model and processor
+processor = AutoProcessor.from_pretrained("Svard/LaViT-3B")
+model = AutoModelForCausalLM.from_pretrained("Svard/LaViT-3B")
+# Load image
+url = "https://example.com/image.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+# Prepare prompt
+prompt = "What is in this image?"
+# Process inputs
+inputs = processor(images=image, text=prompt, return_tensors="pt")
+# Generate response
+outputs = model.generate(**inputs, max_new_tokens=512)
+response = processor.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+### Advanced Usage with Visual Reasoning
+For tasks requiring visual reasoning, you can use the `<lvr>` (Latent Visual Reasoning) tokens:
+```python
+prompt = "Analyze this image step by step: <lvr> What objects are present? <lvr> What are their spatial relationships? <lvr>"
+```
+## 📊 Performance
+LaViT-3B achieves significant improvements on various benchmarks:
+- **MMVP**: Enhanced performance on multi-modal visual perception tasks
+- **BLINK**: Improved results on visual reasoning benchmarks
+- **Visual Grounding**: Up to +16.9% gains on complex reasoning tasks
+## 🏗️ Model Architecture
+- **Base Model**: Qwen2.5-VL-3B-Instruct
+- **Parameters**: 3B
+- **Training Method**: Visual thought trajectory supervision
+- **Key Innovation**: Latent visual thought alignment with curriculum sensory gating
+## 📝 Citation
+If you find this model useful in your research, please cite:
+```bibtex
+@misc{wu2026lavitaligninglatentvisual,
+      title={LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning},
+      author={Linquan Wu and Tianxiang Jiang and Yifei Dong and Haoyu Yang and Fengji Zhang and Shichaang Meng and Ai Xuan and Linqi Song and Jacky Keung},
+      year={2026},
+      eprint={2601.10129},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2601.10129},
+}
+```
+## 📄 License
+This model is licensed under the Apache-2.0 License.
+## 🙏 Acknowledgments
+This model is built upon [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL) and inspired by the [LVR (Latent Visual Reasoning)](https://github.com/VincentLeebang/lvr) framework. We thank the open-source community for their valuable contributions.
+## 🔗 Related Links
+- **Paper**: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129)
+- **Code Repository**: [GitHub](https://github.com/Svardfox/LaViT)
+- **Base Model**: [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)