|
|
--- |
|
|
base_model: Qwen/Qwen2.5-VL-3B-Instruct |
|
|
license: apache-2.0 |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- vision-language |
|
|
- multimodal |
|
|
- reasoning |
|
|
- visual-grounding |
|
|
- computer-vision |
|
|
--- |
|
|
|
|
|
# LaViT-3B: Aligning Latent Visual Thoughts for Multi-modal Reasoning |
|
|
|
|
|
<div align="center\"> |
|
|
|
|
|
**LaViT** is a vision-language model that aligns latent visual thoughts for enhanced multi-modal reasoning. |
|
|
|
|
|
[](https://arxiv.org/abs/2601.10129) |
|
|
[](https://huggingface.co/Svard/LaViT-3B) |
|
|
[](https://github.com/Svardfox/LaViT) |
|
|
|
|
|
</div> |
|
|
|
|
|
## π Overview |
|
|
|
|
|
**LaViT** (Latent Visual Thoughts) addresses a critical **Perception Gap** in multimodal latent reasoning: student models often mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. |
|
|
|
|
|
To bridge this gap, LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- π― **Visual Grounding**: Significantly enhanced visual grounding capabilities |
|
|
- π§ **Multi-modal Reasoning**: Improved performance on complex reasoning tasks |
|
|
- π **Efficient**: Compact 3B model that outperforms larger open-source variants |
|
|
- π **State-of-the-art**: Achieves up to +16.9% gains on complex reasoning tasks |
|
|
|
|
|
## π Paper |
|
|
|
|
|
**Title**: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning |
|
|
|
|
|
**Authors**: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung |
|
|
|
|
|
**Paper Link**: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129) |
|
|
|
|
|
**Abstract**: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o. |
|
|
|
|
|
## π Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch pillow |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoProcessor, AutoModelForCausalLM |
|
|
from PIL import Image |
|
|
import requests |
|
|
|
|
|
# Load model and processor |
|
|
processor = AutoProcessor.from_pretrained("Svard/LaViT-3B") |
|
|
model = AutoModelForCausalLM.from_pretrained("Svard/LaViT-3B") |
|
|
|
|
|
# Load image |
|
|
url = "https://example.com/image.jpg" |
|
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
|
|
# Prepare prompt |
|
|
prompt = "What is in this image?" |
|
|
|
|
|
# Process inputs |
|
|
inputs = processor(images=image, text=prompt, return_tensors="pt") |
|
|
|
|
|
# Generate response |
|
|
outputs = model.generate(**inputs, max_new_tokens=512) |
|
|
response = processor.decode(outputs[0], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Advanced Usage with Visual Reasoning |
|
|
|
|
|
For tasks requiring visual reasoning, you can use the `<lvr>` (Latent Visual Reasoning) tokens: |
|
|
|
|
|
```python |
|
|
prompt = "Analyze this image step by step: <lvr> What objects are present? <lvr> What are their spatial relationships? <lvr>" |
|
|
``` |
|
|
|
|
|
## π Performance |
|
|
|
|
|
LaViT-3B achieves significant improvements on various benchmarks: |
|
|
|
|
|
- **MMVP**: Enhanced performance on multi-modal visual perception tasks |
|
|
- **BLINK**: Improved results on visual reasoning benchmarks |
|
|
- **Visual Grounding**: Up to +16.9% gains on complex reasoning tasks |
|
|
|
|
|
## ποΈ Model Architecture |
|
|
|
|
|
- **Base Model**: Qwen2.5-VL-3B-Instruct |
|
|
- **Parameters**: 3B |
|
|
- **Training Method**: Visual thought trajectory supervision |
|
|
- **Key Innovation**: Latent visual thought alignment with curriculum sensory gating |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you find this model useful in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{wu2026lavitaligninglatentvisual, |
|
|
title={LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning}, |
|
|
author={Linquan Wu and Tianxiang Jiang and Yifei Dong and Haoyu Yang and Fengji Zhang and Shichaang Meng and Ai Xuan and Linqi Song and Jacky Keung}, |
|
|
year={2026}, |
|
|
eprint={2601.10129}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2601.10129}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## π License |
|
|
|
|
|
This model is licensed under the Apache-2.0 License. |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
This model is built upon [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL) and inspired by the [LVR (Latent Visual Reasoning)](https://github.com/VincentLeebang/lvr) framework. We thank the open-source community for their valuable contributions. |
|
|
|
|
|
## π Related Links |
|
|
|
|
|
- **Paper**: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129) |
|
|
- **Code Repository**: [GitHub](https://github.com/Svardfox/LaViT) |
|
|
- **Base Model**: [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) |