HongyuanTao commited on
Commit
4d9f42b
·
verified ·
1 Parent(s): 4611700

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +246 -17
README.md CHANGED
@@ -13,32 +13,261 @@ pipeline_tag: image-text-to-text
13
 
14
  <div align="center">
15
 
16
- # InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input VLMs
 
17
 
18
- <a href="https://arxiv.org/abs/YOUR_ARXIV_ID"><img src="https://img.shields.io/badge/Paper-ArXiv-b31b1b.svg" alt="Paper"></a>
19
- <a href="https://github.com/YOUR_USERNAME/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Code-black" alt="Code"></a>
20
- <a href="LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License"></a>
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  </div>
23
 
24
- ## 📖 Introduction
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- **InfiniteVL** is a linear-complexity Vision-Language Model (VLM) developed by **Huazhong University of Science and Technology (HUST)** and **Horizon Robotics**.
27
 
28
- Traditional Transformer-based VLMs suffer from quadratic computational complexity ($O(N^2)$) and growing KV-cache memory usage. **InfiniteVL** solves this by synergizing **Sliding Window Attention (SWA)** with **Gated DeltaNet**, enabling **unlimited input tokens** and **real-time streaming**.
 
 
 
29
 
30
- ### Key Features
31
- * **🚀 Linear Complexity ($O(N)$):** Reduces per-token latency by **3.6×** compared to Qwen2.5-VL-3B.
32
- * **📉 Constant Memory:** Maintains a fixed GPU memory usage (~9GB) regardless of sequence length.
33
- * **⚡ Real-Time Streaming:** Sustains a stable **24 FPS** throughput for long video understanding on a single RTX 4090.
34
- * **🧠 Hybrid Architecture:** 75% Gated DeltaNet (Global Context) + 25% SWA (Local Detail).
35
 
36
- ![Performance Comparison](teaser.png)
37
 
38
- ## 🛠️ Requirements
39
 
40
- To use InfiniteVL, you need to install the linear attention kernels.
 
 
 
 
 
41
 
 
42
  ```bash
43
- pip install transformers torch
44
- pip install fla # Flash Linear Attention
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  <div align="center">
15
 
16
+ <!-- 这里可以放你的Logo,如果没有Logo可以删掉这一行 -->
17
+ <img src="assets/Logo.png" width="500" alt="InfiniteVL Logo">
18
 
19
+ <hr>
 
 
20
 
21
+ ### InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
22
+
23
+ <!-- 作者列表 -->
24
+ Hongyuan Tao<sup>1</sup>,
25
+ [Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>,
26
+ [Shaoyu Chen](https://scholar.google.com/citations?user=PIeNN2gAAAAJ&hl=en&oi=sra)<sup>2</sup>,
27
+ Haoran Yin<sup>2</sup>,
28
+ [Qian Zhang](https://scholar.google.com/citations?user=pCY-bikAAAAJ&hl=zh-CN)<sup>2</sup>,
29
+ [Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>,
30
+ [Xinggang Wang](https://xwcv.github.io)<sup>1,✉️</sup>
31
+
32
+ <!-- 单位列表 -->
33
+ <sup>1</sup>Huazhong University of Science and Technology,
34
+ <sup>2</sup>Horizon Robotics
35
+
36
+ <!-- 脚注/通讯作者信息 -->
37
+ (✉️) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
38
+
39
+ <!-- 放置 按钮/Badge 的地方 -->
40
+ <br>
41
+ <a href="https://arxiv.org/abs/2502.xxxxx"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
42
+ <a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
43
+
44
+ </div>
45
+
46
+ ## Introduction
47
+
48
+ **InfiniteVL** is a novel linear-complexity Vision-Language Model (VLM) architecture designed to overcome the computational bottlenecks of traditional Transformers in processing **unlimited multimodal streams**.
49
+
50
+
51
+ By synergizing **Sliding Window Attention (SWA)** for fine-grained local perception and **Gated DeltaNet** for efficient long-term memory, InfiniteVL achieves a "best of both worlds" balance. It delivers competitive performance on standard benchmarks (comparable to Qwen2.5-VL) while enabling constant-memory inference and high-throughput streaming.
52
+
53
+ <div align="center">
54
+ <img src="assets/image1_new_01.png" width="800" alt="InfiniteVL Logo">
55
  </div>
56
 
57
+ ### Key Highlights
58
+ * 🚀 **High Efficiency:** Achieves **>3.6×** inference speedup and constant memory footprint compared to FlashAttention-2 accelerated Transformers.
59
+ * ⚡ **Real-Time Streaming:** Sustains a stable **24 FPS** prefill speed on a single **NVIDIA RTX 4090** for continuous video understanding.
60
+ * 🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
61
+ * 🏆 **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
62
+
63
+ ## Model Zoo
64
+
65
+ We release two versions of InfiniteVL-4B to cater to different application scenarios.
66
+
67
+ | Model | Stage | Description | Training context Length | Download |
68
+ | :--- | :---: | :--- | :---: | :---: |
69
+ | **InfiniteVL-4B** | **Stage 2** | **Best Generalist / Base.** The checkpoint directly after Instruction SFT. It delivers the **peak foundational performance** on standard multimodal benchmarks (e.g., OCR, MMMU, MathVista) and preserves the most robust knowledge. | 8K | [🤗 Hugging Face](https://huggingface.co/hustvl/InfiniteVL) |
70
+ | **InfiniteVL-4B-LongSFT** | **Stage 3** | **Long-Context Adapted.** Fine-tuned using only a **small amount** of long-sequence multimodal data. It successfully activates length generalization for streaming scenarios, though its full potential on extreme contexts is not yet fully exploited. | 32K | [🤗 Hugging Face](https://huggingface.co/hustvl/InfiniteVL-LongSFT) |
71
 
 
72
 
73
+ > **💡 Recommendations:**
74
+ >
75
+ > * **For Long-Context Inference:** Please use the **Stage 3** model. It enables stable streaming inference and avoids memory explosion.
76
+ > * **For Training / Fine-tuning:** We strongly recommend using the **Stage 2** model as your starting point. Since it maintains the strongest general capabilities and hasn't shifted towards the specific long-context distribution, it serves as the best foundation for adaptation to new tasks or domains.
77
 
78
+ ## Getting Started
 
 
 
 
79
 
80
+ ### 🛠️ Environment Setup
81
 
82
+ We recommend using **Anaconda** or **Miniconda** to manage the environment. The code is tested on **Python 3.11** + **PyTorch 2.6.0** + **CUDA 12.1**.
83
 
84
+ **1. Create and activate a virtual environment:**
85
+ ```bash
86
+ conda create -n infinitevl python=3.11 -y
87
+ conda activate infinitevl
88
+ ```
89
+ **2. Install Environment:**
90
 
91
+ The core environments are list as follows:
92
  ```bash
93
+ # --- Core Deep Learning ---
94
+ torch==2.6.0
95
+ torchvision==0.21.0
96
+ torchaudio==2.6.0
97
+ transformers==4.57.0
98
+ accelerate==1.8.1
99
+
100
+ # --- Vision & Multimodal ---
101
+ qwen-vl-utils==0.0.11
102
+ decord==0.6.0
103
+ opencv-python==4.11.0.86
104
+ pillow==10.4.0
105
+ timm==1.0.22
106
+ einops==0.8.1
107
+
108
+ # --- Linear Attention & Kernels (Critical) ---
109
+ # Note: These often require specific CUDA environments to build
110
+ flash-attn==2.7.4.post1
111
+ flash-linear-attention==0.4.0
112
+ fla-core==0.4.0
113
+ causal-conv1d==1.5.0.post5
114
+ triton==3.2.0
115
+ ```
116
+
117
+ ### Using 🤗 Transformers to Chat
118
+
119
+ ```python
120
+ import torch
121
+ from transformers import AutoModelForCausalLM, AutoProcessor
122
+ from qwen_vl_utils import process_vision_info
123
+
124
+ # Load Model
125
+ model_path = "InfiniteVL/InfiniteVL-4B" # Replace with your HF repo ID
126
+ model = AutoModelForCausalLM.from_pretrained(
127
+ model_path,
128
+ torch_dtype=torch.bfloat16,
129
+ device_map="auto",
130
+ trust_remote_code=True
131
+ )
132
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
133
+
134
+ # Prepare Inputs
135
+ messages = [
136
+ {
137
+ "role": "user",
138
+ "content": [
139
+ {
140
+ "type": "image",
141
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
142
+ },
143
+ {"type": "text", "text": "Describe this image."},
144
+ ],
145
+ }
146
+ ]
147
+
148
+ # Process Inputs
149
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
150
+ image_inputs, video_inputs = process_vision_info(messages)
151
+ inputs = processor(
152
+ text=[text],
153
+ images=image_inputs,
154
+ videos=video_inputs,
155
+ padding=True,
156
+ return_tensors="pt",
157
+ ).to(model.device)
158
+
159
+ # Generate
160
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
161
+ generated_ids_trimmed = [
162
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
163
+ ]
164
+ output_text = processor.batch_decode(
165
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
166
+ )
167
+ print(output_text[0])
168
+ ```
169
+ <details>
170
+ <summary><strong>🖼️ Multi-Image Inference (Click to expand)</strong></summary>
171
+
172
+ InfiniteVL supports inputting multiple images in a single turn for comparison or storytelling.
173
+
174
+ ```python
175
+ messages = [
176
+ {
177
+ "role": "user",
178
+ "content": [
179
+ {
180
+ "type": "image",
181
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
182
+ },
183
+ {
184
+ "type": "image",
185
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
186
+ },
187
+ {"type": "text", "text": "What are the similarities between these two images?"},
188
+ ],
189
+ }
190
+ ]
191
+
192
+ # Process
193
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
194
+ image_inputs, video_inputs = process_vision_info(messages)
195
+ inputs = processor(
196
+ text=[text],
197
+ images=image_inputs,
198
+ videos=video_inputs,
199
+ padding=True,
200
+ return_tensors="pt",
201
+ ).to(model.device)
202
+
203
+ # Generate
204
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
205
+ generated_ids_trimmed = [
206
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
207
+ ]
208
+ print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
209
+ ```
210
+
211
+ </details>
212
+ <details>
213
+ <summary><strong>🎥 Video Inference (Click to expand)</strong></summary>
214
+
215
+ ```python
216
+ messages = [
217
+ {
218
+ "role": "user",
219
+ "content": [
220
+ {
221
+ "type": "video",
222
+ "video": "file:///path/to/video.mp4",
223
+ "max_pixels": 360 * 420,
224
+ "fps": 1.0,
225
+ },
226
+ {"type": "text", "text": "Describe this video."},
227
+ ],
228
+ }
229
+ ]
230
+
231
+ # Process
232
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
233
+ image_inputs, video_inputs = process_vision_info(messages)
234
+ inputs = processor(
235
+ text=[text],
236
+ images=image_inputs,
237
+ videos=video_inputs,
238
+ padding=True,
239
+ return_tensors="pt",
240
+ ).to(model.device)
241
+
242
+ # Generate
243
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
244
+ generated_ids_trimmed = [
245
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
246
+ ]
247
+ print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
248
+ ```
249
+
250
+ ## 🎥 Advanced Usage (Cuda Graph)
251
+
252
+ Please refer to the guideline in the [github page](https://github.com/hustvl/InfiniteVL).
253
+
254
+ ## Citation
255
+
256
+ If you find InfiniteVL useful for your research or applications, please consider citing our paper:
257
+
258
+ ```bibtex
259
+ @article{tao2025infinitevl,
260
+ title={InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models},
261
+ author={Tao, Hongyuan and Liao, Bencheng and Chen, Shaoyu and Yin, Haoran and Zhang, Qian and Liu, Wenyu and Wang, Xinggang},
262
+ journal={arXiv preprint},
263
+ year={2025}
264
+ }
265
+ ```
266
+
267
+ ## Acknowledgement
268
+
269
+ InfiniteVL is built upon the giants of the open-source community. We would like to express our gratitude to:
270
+
271
+ * **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
272
+ * **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
273
+ * **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.