|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- vision-language-model |
|
|
- image-text-to-text |
|
|
- linear-attention |
|
|
- gated-deltanet |
|
|
- infinitevl |
|
|
- multimodal |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/Logo.png" width="500" alt="InfiniteVL Logo"> |
|
|
|
|
|
<hr> |
|
|
|
|
|
### InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models |
|
|
|
|
|
Hongyuan Tao<sup>1</sup>, |
|
|
[Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>, |
|
|
[Shaoyu Chen](https://scholar.google.com/citations?user=PIeNN2gAAAAJ&hl=en&oi=sra)<sup>2</sup>, |
|
|
Haoran Yin<sup>2</sup>, |
|
|
[Qian Zhang](https://scholar.google.com/citations?user=pCY-bikAAAAJ&hl=zh-CN)<sup>2</sup>, |
|
|
[Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>, |
|
|
[Xinggang Wang](https://xwcv.github.io)<sup>1,✉️</sup> |
|
|
|
|
|
<sup>1</sup>Huazhong University of Science and Technology, |
|
|
<sup>2</sup>Horizon Robotics |
|
|
|
|
|
(✉️) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a> |
|
|
|
|
|
<br> |
|
|
<a href="https://arxiv.org/abs/2512.08829"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a> |
|
|
<a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a> |
|
|
|
|
|
</div> |
|
|
|
|
|
## Introduction |
|
|
|
|
|
**InfiniteVL** is a novel linear-complexity Vision-Language Model (VLM) architecture designed to overcome the computational bottlenecks of traditional Transformers in processing **unlimited multimodal streams**. |
|
|
|
|
|
|
|
|
By synergizing **Sliding Window Attention (SWA)** for fine-grained local perception and **Gated DeltaNet** for efficient long-term memory, InfiniteVL achieves a "best of both worlds" balance. It delivers competitive performance on standard benchmarks (comparable to Qwen2.5-VL) while enabling constant-memory inference and high-throughput streaming. |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image1_new_01.png" width="800" alt="InfiniteVL Logo"> |
|
|
</div> |
|
|
|
|
|
### ✨ Key Highlights |
|
|
* 🚀 **High Efficiency:** Achieves **>3.6×** inference speedup and constant memory footprint compared to FlashAttention-2 accelerated Transformers. |
|
|
* ⚡ **Real-Time Streaming:** Sustains a stable **24 FPS** prefill speed on a single **NVIDIA RTX 4090** for continuous video understanding. |
|
|
* 🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors. |
|
|
* 🏆 **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects. |
|
|
|
|
|
## Model Zoo |
|
|
|
|
|
We release two versions of InfiniteVL-4B to cater to different application scenarios. |
|
|
|
|
|
| Model | Stage | Description | Training context Length | Download | |
|
|
| :--- | :---: | :--- | :---: | :---: | |
|
|
| **InfiniteVL-4B** | **Stage 2** | **Best Generalist / Base.** The checkpoint directly after Instruction SFT. It delivers the **peak foundational performance** on standard multimodal benchmarks (e.g., OCR, MMMU, MathVista) and preserves the most robust knowledge. | 8K | [🤗 Hugging Face](https://huggingface.co/hustvl/InfiniteVL) | |
|
|
| **InfiniteVL-4B-LongSFT** | **Stage 3** | **Long-Context Adapted.** Fine-tuned using only a **small amount** of long-sequence multimodal data. It successfully activates length generalization for streaming scenarios, though its full potential on extreme contexts is not yet fully exploited. | 32K | [🤗 Hugging Face](https://huggingface.co/hustvl/InfiniteVL-LongSFT) | |
|
|
|
|
|
|
|
|
> **💡 Recommendations:** |
|
|
> |
|
|
> * **For Long-Context Inference:** Please use the **Stage 3** model. It enables stable streaming inference and avoids memory explosion. |
|
|
> * **For Training / Fine-tuning:** We strongly recommend using the **Stage 2** model as your starting point. Since it maintains the strongest general capabilities and hasn't shifted towards the specific long-context distribution, it serves as the best foundation for adaptation to new tasks or domains. |
|
|
|
|
|
## Getting Started |
|
|
|
|
|
### 🛠️ Environment Setup |
|
|
|
|
|
We recommend using **Anaconda** or **Miniconda** to manage the environment. The code is tested on **Python 3.11** + **PyTorch 2.6.0** + **CUDA 12.1**. |
|
|
|
|
|
**1. Create and activate a virtual environment:** |
|
|
```bash |
|
|
conda create -n infinitevl python=3.11 -y |
|
|
conda activate infinitevl |
|
|
``` |
|
|
**2. Install Environment:** |
|
|
|
|
|
The core environments are list as follows: |
|
|
```bash |
|
|
# --- Core Deep Learning --- |
|
|
torch==2.6.0 |
|
|
torchvision==0.21.0 |
|
|
torchaudio==2.6.0 |
|
|
transformers==4.57.0 |
|
|
accelerate==1.8.1 |
|
|
|
|
|
# --- Vision & Multimodal --- |
|
|
qwen-vl-utils==0.0.11 |
|
|
decord==0.6.0 |
|
|
opencv-python==4.11.0.86 |
|
|
pillow==10.4.0 |
|
|
timm==1.0.22 |
|
|
einops==0.8.1 |
|
|
|
|
|
# --- Linear Attention & Kernels (Critical) --- |
|
|
# Note: These often require specific CUDA environments to build |
|
|
flash-attn==2.7.4.post1 |
|
|
flash-linear-attention==0.4.0 |
|
|
fla-core==0.4.0 |
|
|
causal-conv1d==1.5.0.post5 |
|
|
triton==3.2.0 |
|
|
``` |
|
|
|
|
|
### Using 🤗 Transformers to Chat |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
# Load Model |
|
|
model_path = "hustvl/InfiniteVL" # Replace with your HF repo ID |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_path, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
|
|
# Prepare Inputs |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", |
|
|
}, |
|
|
{"type": "text", "text": "Describe this image."}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
# Process Inputs |
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
).to(model.device) |
|
|
|
|
|
# Generate |
|
|
generated_ids = model.generate(**inputs, max_new_tokens=128) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
output_text = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
) |
|
|
print(output_text[0]) |
|
|
``` |
|
|
<details> |
|
|
<summary><strong>🖼️ Multi-Image Inference (Click to expand)</strong></summary> |
|
|
|
|
|
InfiniteVL supports inputting multiple images in a single turn for comparison or storytelling. |
|
|
|
|
|
```python |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", |
|
|
}, |
|
|
{ |
|
|
"type": "image", |
|
|
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", |
|
|
}, |
|
|
{"type": "text", "text": "What are the similarities between these two images?"}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
# Process |
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
).to(model.device) |
|
|
|
|
|
# Generate |
|
|
generated_ids = model.generate(**inputs, max_new_tokens=128) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]) |
|
|
``` |
|
|
|
|
|
</details> |
|
|
<details> |
|
|
<summary><strong>🎥 Video Inference (Click to expand)</strong></summary> |
|
|
|
|
|
```python |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "video", |
|
|
"video": "file:///path/to/video.mp4", |
|
|
"max_pixels": 360 * 420, |
|
|
"fps": 1.0, |
|
|
}, |
|
|
{"type": "text", "text": "Describe this video."}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
# Process |
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
).to(model.device) |
|
|
|
|
|
# Generate |
|
|
generated_ids = model.generate(**inputs, max_new_tokens=128) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]) |
|
|
``` |
|
|
</details> |
|
|
|
|
|
## 🎥 Advanced Usage (Cuda Graph) |
|
|
|
|
|
Please refer to the guideline in the [github page](https://github.com/hustvl/InfiniteVL). |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find InfiniteVL useful for your research or applications, please consider citing our paper: |
|
|
|
|
|
```bibtex |
|
|
@article{tao2025infinitevl, |
|
|
title={InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models}, |
|
|
author={Tao, Hongyuan and Liao, Bencheng and Chen, Shaoyu and Yin, Haoran and Zhang, Qian and Liu, Wenyu and Wang, Xinggang}, |
|
|
journal={arXiv preprint}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgement |
|
|
|
|
|
InfiniteVL is built upon the giants of the open-source community. We would like to express our gratitude to: |
|
|
|
|
|
* **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder. |
|
|
* **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA). |
|
|
* **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models. |