Improve model card: Update arXiv link and add comprehensive details from GitHub
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,14 +1,13 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
library_name: transformers
|
|
|
|
|
|
|
| 4 |
tags:
|
| 5 |
- vision-language-model
|
| 6 |
-
- image-text-to-text
|
| 7 |
- linear-attention
|
| 8 |
- gated-deltanet
|
| 9 |
- infinitevl
|
| 10 |
- multimodal
|
| 11 |
-
pipeline_tag: image-text-to-text
|
| 12 |
---
|
| 13 |
|
| 14 |
<div align="center">
|
|
@@ -33,8 +32,9 @@ Haoran Yin<sup>2</sup>,
|
|
| 33 |
(βοΈ) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
|
| 34 |
|
| 35 |
<br>
|
| 36 |
-
<a href="https://arxiv.org/abs/
|
| 37 |
<a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
|
|
|
|
| 38 |
|
| 39 |
</div>
|
| 40 |
|
|
@@ -55,6 +55,102 @@ By synergizing **Sliding Window Attention (SWA)** for fine-grained local percept
|
|
| 55 |
* π§ **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
|
| 56 |
* π **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
## Model Zoo
|
| 59 |
|
| 60 |
We release two versions of InfiniteVL-4B to cater to different application scenarios.
|
|
@@ -243,9 +339,60 @@ print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]
|
|
| 243 |
```
|
| 244 |
</details>
|
| 245 |
|
| 246 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 247 |
|
| 248 |
-
|
|
|
|
| 249 |
|
| 250 |
## Citation
|
| 251 |
|
|
@@ -266,4 +413,4 @@ InfiniteVL is built upon the giants of the open-source community. We would like
|
|
| 266 |
|
| 267 |
* **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
|
| 268 |
* **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
|
| 269 |
-
* **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
tags:
|
| 6 |
- vision-language-model
|
|
|
|
| 7 |
- linear-attention
|
| 8 |
- gated-deltanet
|
| 9 |
- infinitevl
|
| 10 |
- multimodal
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
<div align="center">
|
|
|
|
| 32 |
(βοΈ) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
|
| 33 |
|
| 34 |
<br>
|
| 35 |
+
<a href="https://arxiv.org/abs/2512.08829"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
|
| 36 |
<a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
|
| 37 |
+
<a href="https://huggingface.co/hustvl/InfiniteVL/"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
|
| 38 |
|
| 39 |
</div>
|
| 40 |
|
|
|
|
| 55 |
* π§ **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
|
| 56 |
* π **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
|
| 57 |
|
| 58 |
+
## News
|
| 59 |
+
* `Dec. 10th, 2025`: We release the **InfiniteVL** model weights and inference code! Please check [Model Zoo](#model-zoo).
|
| 60 |
+
* `Dec. 10th, 2025`: We release our paper on [Arxiv](https://arxiv.org/abs/2512.08829).
|
| 61 |
+
|
| 62 |
+
## Table of Contents
|
| 63 |
+
|
| 64 |
+
* [Introduction](#introduction)
|
| 65 |
+
* [Key Highlights](#key-highlights)
|
| 66 |
+
* [News](#news)
|
| 67 |
+
* [Architecture](#architecture)
|
| 68 |
+
* [Training Strategy](#training-strategy)
|
| 69 |
+
* [Performance](#performance)
|
| 70 |
+
* [Model Zoo](#model-zoo)
|
| 71 |
+
* [Getting Started](#getting-started)
|
| 72 |
+
* [Advanced Usage: CUDA Graph Acceleration](#advanced-usage-cuda-graph-acceleration)
|
| 73 |
+
* [Qualitative Analysis & Visualization](#qualitative-analysis--visualization)
|
| 74 |
+
* [Contact](#contact)
|
| 75 |
+
* [Citation](#citation)
|
| 76 |
+
* [Acknowledgement](#acknowledgement)
|
| 77 |
+
|
| 78 |
+
## Architecture
|
| 79 |
+
|
| 80 |
+
<div align="center">
|
| 81 |
+
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/architecture.png" alt="InfiniteVL Architecture" width="50%">
|
| 82 |
+
</div>
|
| 83 |
+
<br>
|
| 84 |
+
|
| 85 |
+
**InfiniteVL** adopts a hybrid architecture that synergizes the efficiency of linear attention with the precision of window-based attention. The model comprises a **Vision Encoder** (adapted from Qwen2.5-VL), a **Projection MLP**, and a **Decoder-only LLM Backbone**.
|
| 86 |
+
|
| 87 |
+
### Key Design Highlights
|
| 88 |
+
|
| 89 |
+
* **Hybrid Block Design**: The LLM backbone consists of **9 Hybrid Blocks**. Within each block, we strategically interleave:
|
| 90 |
+
* **1 Sliding Window Attention (SWA) Layer**: Responsible for capturing high-resolution local context and fine-grained visual details.
|
| 91 |
+
* **3 Gated DeltaNet Layers**: Responsible for modeling long-range global dependencies with linear complexity.
|
| 92 |
+
|
| 93 |
+
* **Constant Memory Footprint**: Unlike traditional Transformers where the Key-Value (KV) cache grows linearly with sequence length ($O(N)$), the **Gated DeltaNet** layers compress history into a fixed-size memory state (e.g., $16 \times 128 \times 256$). This enables **constant memory usage** and constant inference latency, even when processing unlimited input streams.
|
| 94 |
+
|
| 95 |
+
* **Seamless Integration**: By combining SWA and Gated DeltaNet, InfiniteVL achieves the "best of both worlds":
|
| 96 |
+
* Local attention ensures high performance on information-intensive tasks (e.g., OCR, Document Understanding).
|
| 97 |
+
* Linear attention ensures efficiency and stability for long-context scenarios (e.g., Streaming Video Understanding).
|
| 98 |
+
|
| 99 |
+
## Training Strategy
|
| 100 |
+
|
| 101 |
+
To achieve strong multimodal performance with minimal training resources, InfiniteVL employs a **three-stage progressive training strategy**. This approach allows our linear-complexity model to inherit the vast knowledge of a Transformer teacher before adapting to long-context scenarios.
|
| 102 |
+
|
| 103 |
+
<div align="center">
|
| 104 |
+
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/training_strategy.png" alt="Training Pipeline" width="90%">
|
| 105 |
+
</div>
|
| 106 |
+
|
| 107 |
+
### Stage 1: Distillation Pretraining (Efficient Initialization)
|
| 108 |
+
* **Goal:** Rapidly transfer knowledge from the **Qwen2.5-VL** teacher to the InfiniteVL student.
|
| 109 |
+
* **Method:** We replace the teacher's attention layers with **Gated DeltaNet** while keeping other parameters frozen. We use **Layer-wise MSE Loss** (to align internal states) and **End-to-End KL Divergence** (to align output logits).
|
| 110 |
+
* **Significance:** This bypasses the difficulty of training linear attention from scratch, ensuring a robust initialization.
|
| 111 |
+
|
| 112 |
+
### Stage 2: Instruction SFT (General Capabilities)
|
| 113 |
+
* **Goal:** Unlock strong instruction-following and reasoning capabilities.
|
| 114 |
+
* **Data:** **~8M** diverse multimodal instruction pairs, covering General VQA, OCR, Mathematics, and Code.
|
| 115 |
+
* **Settings:** Image resolution increased to **1344Γ1344**; max context length set to 8,192.
|
| 116 |
+
* **Outcome:** Produces the **Stage 2 Model**, which offers the best performance on standard benchmarks.
|
| 117 |
+
|
| 118 |
+
### Stage 3: Long-Sequence SFT (Context Extension)
|
| 119 |
+
* **Goal:** Activate the architecture's potential for **unlimited-length processing** and streaming.
|
| 120 |
+
* **Data:** A mixture of Stage 2 data (800K) and **~200K long-sequence samples** (e.g., long videos, multi-page documents).
|
| 121 |
+
* **Method:** **LoRA** fine-tuning with context length extended to **32,768**.
|
| 122 |
+
* **Outcome:** Produces the **Stage 3 Model**, enabling length extrapolation and stable streaming inference.
|
| 123 |
+
|
| 124 |
+
## Performance
|
| 125 |
+
|
| 126 |
+
### π Efficiency & Streaming
|
| 127 |
+
|
| 128 |
+
**InfiniteVL** is engineered for unlimited-input scenarios. Unlike Transformer-based models where cost grows linearly with history, InfiniteVL maintains **constant** computational cost and memory usage.
|
| 129 |
+
|
| 130 |
+
> **Hardware Setup:** All efficiency results are measured on a single NVIDIA RTX 4090 GPU.
|
| 131 |
+
|
| 132 |
+
<div align="center">
|
| 133 |
+
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/plot_line.png" width="80%" alt="Efficiency Comparison">
|
| 134 |
+
<br>
|
| 135 |
+
<em>Figure 1: Comparison of streaming FPS and latency. InfiniteVL sustains real-time performance while Transformer baselines degrade rapidly.</em>
|
| 136 |
+
</div>
|
| 137 |
+
|
| 138 |
+
### π Multimodal Benchmarks
|
| 139 |
+
|
| 140 |
+
InfiniteVL achieves state-of-the-art performance among linear-complexity VLMs. Crucially, thanks to our **Hybrid Architecture** and **High-quality training strategies**, it overcomes the traditional weakness of linear models in information-intensive tasks (e.g., OCR, Document Understanding), achieving results comparable to top-tier Transformer VLMs.
|
| 141 |
+
|
| 142 |
+
<div align="center">
|
| 143 |
+
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance1.png" width="100%" alt="Performance Comparison">
|
| 144 |
+
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance2.png" width="100%" alt="Performance Comparison">
|
| 145 |
+
<br>
|
| 146 |
+
<em>Figure 2: Comparison of InfiniteVL with existing VLMs on public multimodal understanding, real-world comprehension, text-rich, reasoning-centric multimodal benchmarks.</em>
|
| 147 |
+
</div>
|
| 148 |
+
<br>
|
| 149 |
+
|
| 150 |
+
**Key Takeaways:**
|
| 151 |
+
* **Best-in-Class Linear Model:** Significantly outperforms previous linear VLMs (Cobra, MaTVLM) by large margins (+40-60 points on DocVQA/OCRBench).
|
| 152 |
+
* **Transformer-Level Quality:** Matches the performance of Qwen2.5-VL-3B on complex reasoning and text-rich tasks while being significantly faster in long contexts.
|
| 153 |
+
|
| 154 |
## Model Zoo
|
| 155 |
|
| 156 |
We release two versions of InfiniteVL-4B to cater to different application scenarios.
|
|
|
|
| 339 |
```
|
| 340 |
</details>
|
| 341 |
|
| 342 |
+
## π Advanced Usage: CUDA Graph Acceleration
|
| 343 |
+
|
| 344 |
+
Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for both streaming prefill and decoding, eliminating kernel launch overheads and maximizing GPU utilization.
|
| 345 |
+
|
| 346 |
+
This is the key technology behind our **24 FPS** real-time streaming performance.
|
| 347 |
+
|
| 348 |
+
### β‘ Accelerated Streaming Inference
|
| 349 |
+
|
| 350 |
+
Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for streaming prefill, eliminating kernel launch overheads.
|
| 351 |
+
|
| 352 |
+
We provide a complete script in [`examples/demo_streaming_inference.py`](examples/demo_streaming_inference.py) to demonstrate this capability.
|
| 353 |
+
|
| 354 |
+
> **π₯ Simulation Note:** This script **simulates a real-time streaming scenario** by reading a local video file frame-by-frame. It treats the video as a continuous data stream, updating the global linear memory state on-the-fly without retraining.
|
| 355 |
+
>
|
| 356 |
+
> **β οΈ Requirement:** This demo relies on the specialized model implementation (supporting `StaticCachePrealloc` and CUDA Graphs) located in the **[`infinitevl/infinitevl_streaming`](infinitevl/infinitevl_streaming)** directory. Please ensure your environment is set up correctly to import these modules.
|
| 357 |
+
|
| 358 |
+
#### 1. Run the Simulation Demo
|
| 359 |
+
```bash
|
| 360 |
+
# Make sure you are in the project root
|
| 361 |
+
python examples/demo_streaming_inference.py \
|
| 362 |
+
--model_path /path/to/InfiniteVL-4B \
|
| 363 |
+
--video_path assets/demo.mp4 \
|
| 364 |
+
--fps 30
|
| 365 |
+
```
|
| 366 |
+
|
| 367 |
+
### β‘ Accelerated Decode
|
| 368 |
+
|
| 369 |
+
In addition to streaming prefill, InfiniteVL natively supports **CUDA Graph-accelerated decoding**. By capturing the decoding step into a static graph, we can achieve extremely low-latency token generation, further enhancing the responsiveness of real-time interactions.
|
| 370 |
+
|
| 371 |
+
> π§ **Coming Soon:** The code for accelerated decoding is currently being refactored and cleaned up. We are working hard to release it as soon as possible. Please stay tuned!
|
| 372 |
+
|
| 373 |
+
## Qualitative Analysis & Visualization
|
| 374 |
+
|
| 375 |
+
We provide visualization cases to demonstrate InfiniteVL's robust performance across diverse scenarios, ranging from information-intensive static tasks to ultra-long streaming video understanding.
|
| 376 |
+
|
| 377 |
+
### 1. Fundamental Visual-Language Capabilities (OCR & Reasoning)
|
| 378 |
+
InfiniteVL effectively overcomes the traditional limitations of linear attention in detailed visual perception. By combining Sliding Window Attention with Gated DeltaNet, it excels at **Dense Text Recognition (OCR), Chart Interpretation, and Complex Scene Description**, delivering performance comparable to full-attention Transformers.
|
| 379 |
+
|
| 380 |
+
<div align="center">
|
| 381 |
+
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image_case1_01.png" width="80%" alt="Fundamental Capabilities">
|
| 382 |
+
</div>
|
| 383 |
+
|
| 384 |
+
### 2. Long-Term Streaming Understanding
|
| 385 |
+
The core strength of InfiniteVL lies in its ability to maintain coherent memory over **unlimited input streams**.
|
| 386 |
+
|
| 387 |
+
The examples below demonstrate a continuous street-view video stream. InfiniteVL maintains a constant memory state and accurately answers questions at various timestamps (e.g., Frame 3100, ~1M tokens processed), recalling specific details like "NBC Studios" text or the color of a pedestrian's bag without forgetting.
|
| 388 |
+
|
| 389 |
+
<div align="center">
|
| 390 |
+
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case1_01.png" width="80%" alt="Streaming Capabilities">
|
| 391 |
+
<img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case2_01.png" width="80%" alt="Streaming Capabilities">
|
| 392 |
+
</div>
|
| 393 |
|
| 394 |
+
## Contact
|
| 395 |
+
If you have any questions, please contact Hongyuan Tao via email (hongyuantao@hust.edu.cn).
|
| 396 |
|
| 397 |
## Citation
|
| 398 |
|
|
|
|
| 413 |
|
| 414 |
* **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
|
| 415 |
* **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
|
| 416 |
+
* **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.
|