File size: 10,105 Bytes
214835f b78730a 214835f e3db36e 214835f 4d9f42b 214835f 4d9f42b b78730a 4d9f42b 8f77bda 214835f 4d9f42b 214835f 4d9f42b 214835f 4d9f42b 214835f 4d9f42b 214835f 4d9f42b 214835f 4d9f42b 214835f 4d9f42b 214835f 4d9f42b f466033 4d9f42b 61c5ab3 4d9f42b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | ---
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- image-text-to-text
- linear-attention
- gated-deltanet
- infinitevl
- multimodal
---
<div align="center">
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/Logo.png" width="500" alt="InfiniteVL Logo">
<hr>
### InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Hongyuan Tao<sup>1</sup>,
[Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>,
[Shaoyu Chen](https://scholar.google.com/citations?user=PIeNN2gAAAAJ&hl=en&oi=sra)<sup>2</sup>,
Haoran Yin<sup>2</sup>,
[Qian Zhang](https://scholar.google.com/citations?user=pCY-bikAAAAJ&hl=zh-CN)<sup>2</sup>,
[Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>,
[Xinggang Wang](https://xwcv.github.io)<sup>1,βοΈ</sup>
<sup>1</sup>Huazhong University of Science and Technology,
<sup>2</sup>Horizon Robotics
(βοΈ) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
<br>
<a href="https://arxiv.org/abs/2512.08829"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
<a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
</div>
## Introduction
**InfiniteVL** is a novel linear-complexity Vision-Language Model (VLM) architecture designed to overcome the computational bottlenecks of traditional Transformers in processing **unlimited multimodal streams**.
By synergizing **Sliding Window Attention (SWA)** for fine-grained local perception and **Gated DeltaNet** for efficient long-term memory, InfiniteVL achieves a "best of both worlds" balance. It delivers competitive performance on standard benchmarks (comparable to Qwen2.5-VL) while enabling constant-memory inference and high-throughput streaming.
<div align="center">
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image1_new_01.png" width="800" alt="InfiniteVL Logo">
</div>
### β¨ Key Highlights
* π **High Efficiency:** Achieves **>3.6Γ** inference speedup and constant memory footprint compared to FlashAttention-2 accelerated Transformers.
* β‘ **Real-Time Streaming:** Sustains a stable **24 FPS** prefill speed on a single **NVIDIA RTX 4090** for continuous video understanding.
* π§ **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
* π **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
## Model Zoo
We release two versions of InfiniteVL-4B to cater to different application scenarios.
| Model | Stage | Description | Training context Length | Download |
| :--- | :---: | :--- | :---: | :---: |
| **InfiniteVL-4B** | **Stage 2** | **Best Generalist / Base.** The checkpoint directly after Instruction SFT. It delivers the **peak foundational performance** on standard multimodal benchmarks (e.g., OCR, MMMU, MathVista) and preserves the most robust knowledge. | 8K | [π€ Hugging Face](https://huggingface.co/hustvl/InfiniteVL) |
| **InfiniteVL-4B-LongSFT** | **Stage 3** | **Long-Context Adapted.** Fine-tuned using only a **small amount** of long-sequence multimodal data. It successfully activates length generalization for streaming scenarios, though its full potential on extreme contexts is not yet fully exploited. | 32K | [π€ Hugging Face](https://huggingface.co/hustvl/InfiniteVL-LongSFT) |
> **π‘ Recommendations:**
>
> * **For Long-Context Inference:** Please use the **Stage 3** model. It enables stable streaming inference and avoids memory explosion.
> * **For Training / Fine-tuning:** We strongly recommend using the **Stage 2** model as your starting point. Since it maintains the strongest general capabilities and hasn't shifted towards the specific long-context distribution, it serves as the best foundation for adaptation to new tasks or domains.
## Getting Started
### π οΈ Environment Setup
We recommend using **Anaconda** or **Miniconda** to manage the environment. The code is tested on **Python 3.11** + **PyTorch 2.6.0** + **CUDA 12.1**.
**1. Create and activate a virtual environment:**
```bash
conda create -n infinitevl python=3.11 -y
conda activate infinitevl
```
**2. Install Environment:**
The core environments are list as follows:
```bash
# --- Core Deep Learning ---
torch==2.6.0
torchvision==0.21.0
torchaudio==2.6.0
transformers==4.57.0
accelerate==1.8.1
# --- Vision & Multimodal ---
qwen-vl-utils==0.0.11
decord==0.6.0
opencv-python==4.11.0.86
pillow==10.4.0
timm==1.0.22
einops==0.8.1
# --- Linear Attention & Kernels (Critical) ---
# Note: These often require specific CUDA environments to build
flash-attn==2.7.4.post1
flash-linear-attention==0.4.0
fla-core==0.4.0
causal-conv1d==1.5.0.post5
triton==3.2.0
```
### Using π€ Transformers to Chat
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load Model
model_path = "hustvl/InfiniteVL" # Replace with your HF repo ID
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Prepare Inputs
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Process Inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```
<details>
<summary><strong>πΌοΈ Multi-Image Inference (Click to expand)</strong></summary>
InfiniteVL supports inputting multiple images in a single turn for comparison or storytelling.
```python
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "What are the similarities between these two images?"},
],
}
]
# Process
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
```
</details>
<details>
<summary><strong>π₯ Video Inference (Click to expand)</strong></summary>
```python
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video.mp4",
"max_pixels": 360 * 420,
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Process
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
```
</details>
## π₯ Advanced Usage (Cuda Graph)
Please refer to the guideline in the [github page](https://github.com/hustvl/InfiniteVL).
## Citation
If you find InfiniteVL useful for your research or applications, please consider citing our paper:
```bibtex
@article{tao2025infinitevl,
title={InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models},
author={Tao, Hongyuan and Liao, Bencheng and Chen, Shaoyu and Yin, Haoran and Zhang, Qian and Liu, Wenyu and Wang, Xinggang},
journal={arXiv preprint},
year={2025}
}
```
## Acknowledgement
InfiniteVL is built upon the giants of the open-source community. We would like to express our gratitude to:
* **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
* **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
* **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models. |