hustvl
/

InfiniteVL

@@ -13,32 +13,261 @@ pipeline_tag: image-text-to-text
 <div align="center">
-# InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input VLMs
-<a href="https://arxiv.org/abs/YOUR_ARXIV_ID"><img src="https://img.shields.io/badge/Paper-ArXiv-b31b1b.svg" alt="Paper"></a>
-<a href="https://github.com/YOUR_USERNAME/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Code-black" alt="Code"></a>
-<a href="LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License"></a>
 </div>
-## 📖 Introduction
-**InfiniteVL** is a linear-complexity Vision-Language Model (VLM) developed by **Huazhong University of Science and Technology (HUST)** and **Horizon Robotics**.
-Traditional Transformer-based VLMs suffer from quadratic computational complexity ($O(N^2)$) and growing KV-cache memory usage. **InfiniteVL** solves this by synergizing **Sliding Window Attention (SWA)** with **Gated DeltaNet**, enabling **unlimited input tokens** and **real-time streaming**.
-### Key Features
-*   **🚀 Linear Complexity ($O(N)$):** Reduces per-token latency by **3.6×** compared to Qwen2.5-VL-3B.
-*   **📉 Constant Memory:** Maintains a fixed GPU memory usage (~9GB) regardless of sequence length.
-*   **⚡ Real-Time Streaming:** Sustains a stable **24 FPS** throughput for long video understanding on a single RTX 4090.
-*   **🧠 Hybrid Architecture:** 75% Gated DeltaNet (Global Context) + 25% SWA (Local Detail).
-![Performance Comparison](teaser.png)
-## 🛠️ Requirements
-To use InfiniteVL, you need to install the linear attention kernels.
 ```bash
-pip install transformers torch
-pip install fla  # Flash Linear Attention

 <div align="center">
+<!-- 这里可以放你的Logo，如果没有Logo可以删掉这一行 -->
+<img src="assets/Logo.png" width="500" alt="InfiniteVL Logo">
+<hr>
+### InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
+<!-- 作者列表 -->
+Hongyuan Tao<sup>1</sup>,
+[Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>,
+[Shaoyu Chen](https://scholar.google.com/citations?user=PIeNN2gAAAAJ&hl=en&oi=sra)<sup>2</sup>,
+Haoran Yin<sup>2</sup>,
+[Qian Zhang](https://scholar.google.com/citations?user=pCY-bikAAAAJ&hl=zh-CN)<sup>2</sup>,
+[Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>,
+[Xinggang Wang](https://xwcv.github.io)<sup>1,✉️</sup>
+<!-- 单位列表 -->
+<sup>1</sup>Huazhong University of Science and Technology,
+<sup>2</sup>Horizon Robotics
+<!-- 脚注/通讯作者信息 -->
+(✉️) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
+<!-- 放置 按钮/Badge 的地方 -->
+<br>
+<a href="https://arxiv.org/abs/2502.xxxxx"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
+<a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
+</div>
+## Introduction
+**InfiniteVL** is a novel linear-complexity Vision-Language Model (VLM) architecture designed to overcome the computational bottlenecks of traditional Transformers in processing **unlimited multimodal streams**.
+By synergizing **Sliding Window Attention (SWA)** for fine-grained local perception and **Gated DeltaNet** for efficient long-term memory, InfiniteVL achieves a "best of both worlds" balance. It delivers competitive performance on standard benchmarks (comparable to Qwen2.5-VL) while enabling constant-memory inference and high-throughput streaming.
+<div align="center">
+<img src="assets/image1_new_01.png" width="800" alt="InfiniteVL Logo">
 </div>
+### ✨ Key Highlights
+*   🚀 **High Efficiency:** Achieves **>3.6×** inference speedup and constant memory footprint compared to FlashAttention-2 accelerated Transformers.
+*   ⚡ **Real-Time Streaming:** Sustains a stable **24 FPS** prefill speed on a single **NVIDIA RTX 4090** for continuous video understanding.
+*   🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
+*   🏆 **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
+## Model Zoo
+We release two versions of InfiniteVL-4B to cater to different application scenarios.
+| Model | Stage | Description | Training context Length | Download |
+| :--- | :---: | :--- | :---: | :---: |
+| **InfiniteVL-4B** | **Stage 2** | **Best Generalist / Base.** The checkpoint directly after Instruction SFT. It delivers the **peak foundational performance** on standard multimodal benchmarks (e.g., OCR, MMMU, MathVista) and preserves the most robust knowledge. | 8K | [🤗 Hugging Face](https://huggingface.co/hustvl/InfiniteVL) |
+| **InfiniteVL-4B-LongSFT** | **Stage 3** | **Long-Context Adapted.** Fine-tuned using only a **small amount** of long-sequence multimodal data. It successfully activates length generalization for streaming scenarios, though its full potential on extreme contexts is not yet fully exploited. | 32K | [🤗 Hugging Face](https://huggingface.co/hustvl/InfiniteVL-LongSFT) |
+> **💡 Recommendations:**
+>
+> *   **For Long-Context Inference:** Please use the **Stage 3** model. It enables stable streaming inference and avoids memory explosion.
+> *   **For Training / Fine-tuning:** We strongly recommend using the **Stage 2** model as your starting point. Since it maintains the strongest general capabilities and hasn't shifted towards the specific long-context distribution, it serves as the best foundation for adaptation to new tasks or domains.
+## Getting Started
+### 🛠️ Environment Setup
+We recommend using **Anaconda** or **Miniconda** to manage the environment. The code is tested on **Python 3.11** + **PyTorch 2.6.0** + **CUDA 12.1**.
+**1. Create and activate a virtual environment:**
+```bash
+conda create -n infinitevl python=3.11 -y
+conda activate infinitevl
+```
+**2. Install Environment:**
+The core environments are list as follows:
 ```bash
+# --- Core Deep Learning ---
+torch==2.6.0
+torchvision==0.21.0
+torchaudio==2.6.0
+transformers==4.57.0
+accelerate==1.8.1
+# --- Vision & Multimodal ---
+qwen-vl-utils==0.0.11
+decord==0.6.0
+opencv-python==4.11.0.86
+pillow==10.4.0
+timm==1.0.22
+einops==0.8.1
+# --- Linear Attention & Kernels (Critical) ---
+# Note: These often require specific CUDA environments to build
+flash-attn==2.7.4.post1
+flash-linear-attention==0.4.0
+fla-core==0.4.0
+causal-conv1d==1.5.0.post5
+triton==3.2.0
+```
+### Using 🤗 Transformers to Chat
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# Load Model
+model_path = "InfiniteVL/InfiniteVL-4B" # Replace with your HF repo ID
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+# Prepare Inputs
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+# Process Inputs
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+).to(model.device)
+# Generate
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text[0])
+```
+<details>
+<summary><strong>🖼️ Multi-Image Inference (Click to expand)</strong></summary>
+InfiniteVL supports inputting multiple images in a single turn for comparison or storytelling.
+```python
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {"type": "text", "text": "What are the similarities between these two images?"},
+        ],
+    }
+]
+# Process
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+).to(model.device)
+# Generate
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
+```
+</details>
+<details>
+<summary><strong>🎥 Video Inference (Click to expand)</strong></summary>
+```python
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": "file:///path/to/video.mp4",
+                "max_pixels": 360 * 420,
+                "fps": 1.0,
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+# Process
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+).to(model.device)
+# Generate
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
+```
+## 🎥 Advanced Usage (Cuda Graph)
+Please refer to the guideline in the [github page](https://github.com/hustvl/InfiniteVL).
+## Citation
+If you find InfiniteVL useful for your research or applications, please consider citing our paper:
+```bibtex
+@article{tao2025infinitevl,
+  title={InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models},
+  author={Tao, Hongyuan and Liao, Bencheng and Chen, Shaoyu and Yin, Haoran and Zhang, Qian and Liu, Wenyu and Wang, Xinggang},
+  journal={arXiv preprint},
+  year={2025}
+}
+```
+## Acknowledgement
+InfiniteVL is built upon the giants of the open-source community. We would like to express our gratitude to:
+*   **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
+*   **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
+*   **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.