LaViT-3B / README.md

Update pipeline tag and add library_name (#1)

f85a9f7 verified about 15 hours ago

5.64 kB

	---
	base_model: Qwen/Qwen2.5-VL-3B-Instruct
	license: apache-2.0
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- vision-language
	- multimodal
	- reasoning
	- visual-grounding
	- computer-vision
	---

	# LaViT-3B: Aligning Latent Visual Thoughts for Multi-modal Reasoning

	<div align="center\">

	LaViT is a vision-language model that aligns latent visual thoughts for enhanced multi-modal reasoning.

	[![Paper](https://img.shields.io/badge/Paper-arXiv:2601.10129-b31b1b.svg)](https://arxiv.org/abs/2601.10129)
	[![Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-yellow.svg)](https://huggingface.co/Svard/LaViT-3B)
	[![GitHub](https://img.shields.io/badge/GitHub-Code-blue?logo=github)](https://github.com/Svardfox/LaViT)

	</div>

	## 📖 Overview

	LaViT (Latent Visual Thoughts) addresses a critical Perception Gap in multimodal latent reasoning: student models often mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception.

	To bridge this gap, LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning.

	### Key Features

	- 🎯 Visual Grounding: Significantly enhanced visual grounding capabilities
	- 🧠 Multi-modal Reasoning: Improved performance on complex reasoning tasks
	- 📊 Efficient: Compact 3B model that outperforms larger open-source variants
	- 🚀 State-of-the-art: Achieves up to +16.9% gains on complex reasoning tasks

	## 📄 Paper

	Title: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

	Authors: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung

	Paper Link: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129)

	Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.

	## 🚀 Usage

	### Installation

	```bash
	pip install transformers torch pillow
	```

	### Basic Usage

	```python
	from transformers import AutoProcessor, AutoModelForCausalLM
	from PIL import Image
	import requests

	# Load model and processor
	processor = AutoProcessor.from_pretrained("Svard/LaViT-3B")
	model = AutoModelForCausalLM.from_pretrained("Svard/LaViT-3B")

	# Load image
	url = "https://example.com/image.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	# Prepare prompt
	prompt = "What is in this image?"

	# Process inputs
	inputs = processor(images=image, text=prompt, return_tensors="pt")

	# Generate response
	outputs = model.generate(**inputs, max_new_tokens=512)
	response = processor.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	### Advanced Usage with Visual Reasoning

	For tasks requiring visual reasoning, you can use the `<lvr>` (Latent Visual Reasoning) tokens:

	```python
	prompt = "Analyze this image step by step: <lvr> What objects are present? <lvr> What are their spatial relationships? <lvr>"
	```

	## 📊 Performance

	LaViT-3B achieves significant improvements on various benchmarks:

	- MMVP: Enhanced performance on multi-modal visual perception tasks
	- BLINK: Improved results on visual reasoning benchmarks
	- Visual Grounding: Up to +16.9% gains on complex reasoning tasks

	## 🏗️ Model Architecture

	- Base Model: Qwen2.5-VL-3B-Instruct
	- Parameters: 3B
	- Training Method: Visual thought trajectory supervision
	- Key Innovation: Latent visual thought alignment with curriculum sensory gating

	## 📝 Citation

	If you find this model useful in your research, please cite:

	```bibtex
	@misc{wu2026lavitaligninglatentvisual,
	title={LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning},
	author={Linquan Wu and Tianxiang Jiang and Yifei Dong and Haoyu Yang and Fengji Zhang and Shichaang Meng and Ai Xuan and Linqi Song and Jacky Keung},
	year={2026},
	eprint={2601.10129},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2601.10129},
	}
	```

	## 📄 License

	This model is licensed under the Apache-2.0 License.

	## 🙏 Acknowledgments

	This model is built upon [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL) and inspired by the [LVR (Latent Visual Reasoning)](https://github.com/VincentLeebang/lvr) framework. We thank the open-source community for their valuable contributions.

	## 🔗 Related Links

	- Paper: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129)
	- Code Repository: [GitHub](https://github.com/Svardfox/LaViT)
	- Base Model: [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)