Svard commited on
Commit
9b25270
Β·
verified Β·
1 Parent(s): 3f4a69a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +133 -3
README.md CHANGED
@@ -1,3 +1,133 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-VL-3B-Instruct
4
+ tags:
5
+ - vision-language
6
+ - multimodal
7
+ - reasoning
8
+ - visual-grounding
9
+ - computer-vision
10
+ pipeline_tag: visual-question-answering
11
+ ---
12
+
13
+ # LaViT-3B: Aligning Latent Visual Thoughts for Multi-modal Reasoning
14
+
15
+ <div align="center">
16
+
17
+ **LaViT** is a vision-language model that aligns latent visual thoughts for enhanced multi-modal reasoning.
18
+
19
+ [![Paper](https://img.shields.io/badge/Paper-arXiv:2601.10129-b31b1b.svg)](https://arxiv.org/abs/2601.10129)
20
+ [![Model](https://img.shields.io/badge/πŸ€—%20HuggingFace-Model-yellow.svg)](https://huggingface.co/Svard/LaViT-3B)
21
+
22
+ </div>
23
+
24
+ ## πŸ“– Overview
25
+
26
+ **LaViT** (Latent Visual Thoughts) addresses a critical **Perception Gap** in multimodal latent reasoning: student models often mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception.
27
+
28
+ To bridge this gap, LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning.
29
+
30
+ ### Key Features
31
+
32
+ - 🎯 **Visual Grounding**: Significantly enhanced visual grounding capabilities
33
+ - 🧠 **Multi-modal Reasoning**: Improved performance on complex reasoning tasks
34
+ - πŸ“Š **Efficient**: Compact 3B model that outperforms larger open-source variants
35
+ - πŸš€ **State-of-the-art**: Achieves up to +16.9% gains on complex reasoning tasks
36
+
37
+ ## πŸ“„ Paper
38
+
39
+ **Title**: LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
40
+
41
+ **Authors**: Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung
42
+
43
+ **Paper Link**: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129)
44
+
45
+ **Abstract**: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.
46
+
47
+ ## πŸš€ Usage
48
+
49
+ ### Installation
50
+
51
+ ```bash
52
+ pip install transformers torch pillow
53
+ ```
54
+
55
+ ### Basic Usage
56
+
57
+ ```python
58
+ from transformers import AutoProcessor, AutoModelForCausalLM
59
+ from PIL import Image
60
+ import requests
61
+
62
+ # Load model and processor
63
+ processor = AutoProcessor.from_pretrained("Svard/LaViT-3B")
64
+ model = AutoModelForCausalLM.from_pretrained("Svard/LaViT-3B")
65
+
66
+ # Load image
67
+ url = "https://example.com/image.jpg"
68
+ image = Image.open(requests.get(url, stream=True).raw)
69
+
70
+ # Prepare prompt
71
+ prompt = "What is in this image?"
72
+
73
+ # Process inputs
74
+ inputs = processor(images=image, text=prompt, return_tensors="pt")
75
+
76
+ # Generate response
77
+ outputs = model.generate(**inputs, max_new_tokens=512)
78
+ response = processor.decode(outputs[0], skip_special_tokens=True)
79
+ print(response)
80
+ ```
81
+
82
+ ### Advanced Usage with Visual Reasoning
83
+
84
+ For tasks requiring visual reasoning, you can use the `<lvr>` (Latent Visual Reasoning) tokens:
85
+
86
+ ```python
87
+ prompt = "Analyze this image step by step: <lvr> What objects are present? <lvr> What are their spatial relationships? <lvr>"
88
+ ```
89
+
90
+ ## πŸ“Š Performance
91
+
92
+ LaViT-3B achieves significant improvements on various benchmarks:
93
+
94
+ - **MMVP**: Enhanced performance on multi-modal visual perception tasks
95
+ - **BLINK**: Improved results on visual reasoning benchmarks
96
+ - **Visual Grounding**: Up to +16.9% gains on complex reasoning tasks
97
+
98
+ ## πŸ—οΈ Model Architecture
99
+
100
+ - **Base Model**: Qwen2.5-VL-3B-Instruct
101
+ - **Parameters**: 3B
102
+ - **Training Method**: Visual thought trajectory supervision
103
+ - **Key Innovation**: Latent visual thought alignment with curriculum sensory gating
104
+
105
+ ## πŸ“ Citation
106
+
107
+ If you find this model useful in your research, please cite:
108
+
109
+ ```bibtex
110
+ @misc{wu2026lavitaligninglatentvisual,
111
+ title={LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning},
112
+ author={Linquan Wu and Tianxiang Jiang and Yifei Dong and Haoyu Yang and Fengji Zhang and Shichaang Meng and Ai Xuan and Linqi Song and Jacky Keung},
113
+ year={2026},
114
+ eprint={2601.10129},
115
+ archivePrefix={arXiv},
116
+ primaryClass={cs.CV},
117
+ url={https://arxiv.org/abs/2601.10129},
118
+ }
119
+ ```
120
+
121
+ ## πŸ“„ License
122
+
123
+ This model is licensed under the Apache-2.0 License.
124
+
125
+ ## πŸ™ Acknowledgments
126
+
127
+ This model is built upon [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL) and inspired by the [LVR (Latent Visual Reasoning)](https://github.com/VincentLeebang/lvr) framework. We thank the open-source community for their valuable contributions.
128
+
129
+ ## πŸ”— Related Links
130
+
131
+ - **Paper**: [arXiv:2601.10129](https://arxiv.org/abs/2601.10129)
132
+ - **Code Repository**: [GitHub](https://github.com/Svardfox/LaViT)
133
+ - **Base Model**: [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)