Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,133 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model: Qwen/Qwen2.5-VL-3B-Instruct
|
| 4 |
+
tags:
|
| 5 |
+
- vision-language
|
| 6 |
+
- multimodal
|
| 7 |
+
- reasoning
|
| 8 |
+
- visual-grounding
|
| 9 |
+
- computer-vision
|
| 10 |
+
pipeline_tag: visual-question-answering
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# LaViT-3B: Aligning Latent Visual Thoughts for Multi-modal Reasoning
|
| 14 |
+
|
| 15 |
+
<div align="center">
|
| 16 |
+
|
| 17 |
+
**LaViT** is a vision-language model that aligns latent visual thoughts for enhanced multi-modal reasoning.
|
| 18 |
+
|
| 19 |
+
[](https://arxiv.org/abs/2601.10129)
|
| 20 |
+
[](https://huggingface.co/Svard/LaViT-3B)
|
| 21 |
+
|
| 22 |
+
</div>
|
| 23 |
+
|
| 24 |
+
## π Overview
|
| 25 |
+
|
| 26 |
+
**LaViT** (Latent Visual Thoughts) addresses a critical **Perception Gap** in multimodal latent reasoning: student models often mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception.
|
| 27 |
+
|
| 28 |
+
To bridge this gap, LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning.
|
| 29 |
+
|
| 30 |
+
### Key Features
|
| 31 |
+
|
| 32 |
+
- π― **Visual Grounding**: Significantly enhanced visual grounding capabilities
|
| 33 |
+
- π§ **Multi-modal Reasoning**: Improved performance on complex reasoning tasks
|
| 34 |
+
- π **Efficient**: Compact 3B model that outperforms larger open-source variants
|
| 35 |
+
- π **State-of-the-art**: Achieves up to +16.9% gains on complex reasoning tasks
|
| 36 |
+
|
| 37 |
+
## π Paper
|
| 38 |
+
|
| 39 |
+
**Title**: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
|
| 40 |
+
|
| 41 |
+
**Authors**: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung
|
| 42 |
+
|
| 43 |
+
**Paper Link**: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129)
|
| 44 |
+
|
| 45 |
+
**Abstract**: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
|
| 46 |
+
|
| 47 |
+
## π Usage
|
| 48 |
+
|
| 49 |
+
### Installation
|
| 50 |
+
|
| 51 |
+
```bash
|
| 52 |
+
pip install transformers torch pillow
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
### Basic Usage
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
from transformers import AutoProcessor, AutoModelForCausalLM
|
| 59 |
+
from PIL import Image
|
| 60 |
+
import requests
|
| 61 |
+
|
| 62 |
+
# Load model and processor
|
| 63 |
+
processor = AutoProcessor.from_pretrained("Svard/LaViT-3B")
|
| 64 |
+
model = AutoModelForCausalLM.from_pretrained("Svard/LaViT-3B")
|
| 65 |
+
|
| 66 |
+
# Load image
|
| 67 |
+
url = "https://example.com/image.jpg"
|
| 68 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
| 69 |
+
|
| 70 |
+
# Prepare prompt
|
| 71 |
+
prompt = "What is in this image?"
|
| 72 |
+
|
| 73 |
+
# Process inputs
|
| 74 |
+
inputs = processor(images=image, text=prompt, return_tensors="pt")
|
| 75 |
+
|
| 76 |
+
# Generate response
|
| 77 |
+
outputs = model.generate(**inputs, max_new_tokens=512)
|
| 78 |
+
response = processor.decode(outputs[0], skip_special_tokens=True)
|
| 79 |
+
print(response)
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
### Advanced Usage with Visual Reasoning
|
| 83 |
+
|
| 84 |
+
For tasks requiring visual reasoning, you can use the `<lvr>` (Latent Visual Reasoning) tokens:
|
| 85 |
+
|
| 86 |
+
```python
|
| 87 |
+
prompt = "Analyze this image step by step: <lvr> What objects are present? <lvr> What are their spatial relationships? <lvr>"
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
## π Performance
|
| 91 |
+
|
| 92 |
+
LaViT-3B achieves significant improvements on various benchmarks:
|
| 93 |
+
|
| 94 |
+
- **MMVP**: Enhanced performance on multi-modal visual perception tasks
|
| 95 |
+
- **BLINK**: Improved results on visual reasoning benchmarks
|
| 96 |
+
- **Visual Grounding**: Up to +16.9% gains on complex reasoning tasks
|
| 97 |
+
|
| 98 |
+
## ποΈ Model Architecture
|
| 99 |
+
|
| 100 |
+
- **Base Model**: Qwen2.5-VL-3B-Instruct
|
| 101 |
+
- **Parameters**: 3B
|
| 102 |
+
- **Training Method**: Visual thought trajectory supervision
|
| 103 |
+
- **Key Innovation**: Latent visual thought alignment with curriculum sensory gating
|
| 104 |
+
|
| 105 |
+
## π Citation
|
| 106 |
+
|
| 107 |
+
If you find this model useful in your research, please cite:
|
| 108 |
+
|
| 109 |
+
```bibtex
|
| 110 |
+
@misc{wu2026lavitaligninglatentvisual,
|
| 111 |
+
title={LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning},
|
| 112 |
+
author={Linquan Wu and Tianxiang Jiang and Yifei Dong and Haoyu Yang and Fengji Zhang and Shichaang Meng and Ai Xuan and Linqi Song and Jacky Keung},
|
| 113 |
+
year={2026},
|
| 114 |
+
eprint={2601.10129},
|
| 115 |
+
archivePrefix={arXiv},
|
| 116 |
+
primaryClass={cs.CV},
|
| 117 |
+
url={https://arxiv.org/abs/2601.10129},
|
| 118 |
+
}
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
## π License
|
| 122 |
+
|
| 123 |
+
This model is licensed under the Apache-2.0 License.
|
| 124 |
+
|
| 125 |
+
## π Acknowledgments
|
| 126 |
+
|
| 127 |
+
This model is built upon [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL) and inspired by the [LVR (Latent Visual Reasoning)](https://github.com/VincentLeebang/lvr) framework. We thank the open-source community for their valuable contributions.
|
| 128 |
+
|
| 129 |
+
## π Related Links
|
| 130 |
+
|
| 131 |
+
- **Paper**: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129)
|
| 132 |
+
- **Code Repository**: [GitHub](https://github.com/Svardfox/LaViT)
|
| 133 |
+
- **Base Model**: [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
|