InfiniteVL / README.md

Fix incorrect arXiv paper link in model card

b78730a verified 2 months ago

10.1 kB

	---
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-text-to-text
	tags:
	- vision-language-model
	- image-text-to-text
	- linear-attention
	- gated-deltanet
	- infinitevl
	- multimodal
	---

	<div align="center">

	<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/Logo.png" width="500" alt="InfiniteVL Logo">

	<hr>

	### InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

	Hongyuan Tao<sup>1</sup>,
	[Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>,
	[Shaoyu Chen](https://scholar.google.com/citations?user=PIeNN2gAAAAJ&hl=en&oi=sra)<sup>2</sup>,
	Haoran Yin<sup>2</sup>,
	[Qian Zhang](https://scholar.google.com/citations?user=pCY-bikAAAAJ&hl=zh-CN)<sup>2</sup>,
	[Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>,
	[Xinggang Wang](https://xwcv.github.io)<sup>1,✉️</sup>

	<sup>1</sup>Huazhong University of Science and Technology,
	<sup>2</sup>Horizon Robotics

	(✉️) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>

	<br>
	<a href="https://arxiv.org/abs/2512.08829"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
	<a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>

	</div>

	## Introduction

	InfiniteVL is a novel linear-complexity Vision-Language Model (VLM) architecture designed to overcome the computational bottlenecks of traditional Transformers in processing unlimited multimodal streams.


	By synergizing Sliding Window Attention (SWA) for fine-grained local perception and Gated DeltaNet for efficient long-term memory, InfiniteVL achieves a "best of both worlds" balance. It delivers competitive performance on standard benchmarks (comparable to Qwen2.5-VL) while enabling constant-memory inference and high-throughput streaming.

	<div align="center">
	<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image1_new_01.png" width="800" alt="InfiniteVL Logo">
	</div>

	### ✨ Key Highlights
	* 🚀 High Efficiency: Achieves >3.6× inference speedup and constant memory footprint compared to FlashAttention-2 accelerated Transformers.
	* ⚡ Real-Time Streaming: Sustains a stable 24 FPS prefill speed on a single NVIDIA RTX 4090 for continuous video understanding.
	* 🧠 Unlimited Context: Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
	* 🏆 Strong Performance: Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.

	## Model Zoo

	We release two versions of InfiniteVL-4B to cater to different application scenarios.

	\| Model \| Stage \| Description \| Training context Length \| Download \|
	\| :--- \| :---: \| :--- \| :---: \| :---: \|
	\| InfiniteVL-4B \| Stage 2 \| Best Generalist / Base. The checkpoint directly after Instruction SFT. It delivers the peak foundational performance on standard multimodal benchmarks (e.g., OCR, MMMU, MathVista) and preserves the most robust knowledge. \| 8K \| [🤗 Hugging Face](https://huggingface.co/hustvl/InfiniteVL) \|
	\| InfiniteVL-4B-LongSFT \| Stage 3 \| Long-Context Adapted. Fine-tuned using only a small amount of long-sequence multimodal data. It successfully activates length generalization for streaming scenarios, though its full potential on extreme contexts is not yet fully exploited. \| 32K \| [🤗 Hugging Face](https://huggingface.co/hustvl/InfiniteVL-LongSFT) \|


	> 💡 Recommendations:
	>
	> * For Long-Context Inference: Please use the Stage 3 model. It enables stable streaming inference and avoids memory explosion.
	> * For Training / Fine-tuning: We strongly recommend using the Stage 2 model as your starting point. Since it maintains the strongest general capabilities and hasn't shifted towards the specific long-context distribution, it serves as the best foundation for adaptation to new tasks or domains.

	## Getting Started

	### 🛠️ Environment Setup

	We recommend using Anaconda or Miniconda to manage the environment. The code is tested on Python 3.11 + PyTorch 2.6.0 + CUDA 12.1.

	1. Create and activate a virtual environment:
	```bash
	conda create -n infinitevl python=3.11 -y
	conda activate infinitevl
	```
	2. Install Environment:

	The core environments are list as follows:
	```bash
	# --- Core Deep Learning ---
	torch==2.6.0
	torchvision==0.21.0
	torchaudio==2.6.0
	transformers==4.57.0
	accelerate==1.8.1

	# --- Vision & Multimodal ---
	qwen-vl-utils==0.0.11
	decord==0.6.0
	opencv-python==4.11.0.86
	pillow==10.4.0
	timm==1.0.22
	einops==0.8.1

	# --- Linear Attention & Kernels (Critical) ---
	# Note: These often require specific CUDA environments to build
	flash-attn==2.7.4.post1
	flash-linear-attention==0.4.0
	fla-core==0.4.0
	causal-conv1d==1.5.0.post5
	triton==3.2.0
	```

	### Using 🤗 Transformers to Chat

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor
	from qwen_vl_utils import process_vision_info

	# Load Model
	model_path = "hustvl/InfiniteVL" # Replace with your HF repo ID
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True
	)
	processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

	# Prepare Inputs
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": "Describe this image."},
	],
	}
	]

	# Process Inputs
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	).to(model.device)

	# Generate
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text[0])
	```
	<details>
	<summary><strong>🖼️ Multi-Image Inference (Click to expand)</strong></summary>

	InfiniteVL supports inputting multiple images in a single turn for comparison or storytelling.

	```python
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": "What are the similarities between these two images?"},
	],
	}
	]

	# Process
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	).to(model.device)

	# Generate
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
	```

	</details>
	<details>
	<summary><strong>🎥 Video Inference (Click to expand)</strong></summary>

	```python
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "video",
	"video": "file:///path/to/video.mp4",
	"max_pixels": 360 * 420,
	"fps": 1.0,
	},
	{"type": "text", "text": "Describe this video."},
	],
	}
	]

	# Process
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	).to(model.device)

	# Generate
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
	```
	</details>

	## 🎥 Advanced Usage (Cuda Graph)

	Please refer to the guideline in the [github page](https://github.com/hustvl/InfiniteVL).

	## Citation

	If you find InfiniteVL useful for your research or applications, please consider citing our paper:

	```bibtex
	@article{tao2025infinitevl,
	title={InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models},
	author={Tao, Hongyuan and Liao, Bencheng and Chen, Shaoyu and Yin, Haoran and Zhang, Qian and Liu, Wenyu and Wang, Xinggang},
	journal={arXiv preprint},
	year={2025}
	}
	```

	## Acknowledgement

	InfiniteVL is built upon the giants of the open-source community. We would like to express our gratitude to:

	* [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL): For providing a powerful vision-language codebase and vision encoder.
	* [Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention): For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
	* Open-Source Datasets: We sincerely thank the creators of the high-quality datasets used in our training, including FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video, and others. Their contributions are essential to the development of efficient multimodal models.