Improve model card: Update arXiv link and add comprehensive details from GitHub

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +154 -7
README.md CHANGED
@@ -1,14 +1,13 @@
1
  ---
2
- license: apache-2.0
3
  library_name: transformers
 
 
4
  tags:
5
  - vision-language-model
6
- - image-text-to-text
7
  - linear-attention
8
  - gated-deltanet
9
  - infinitevl
10
  - multimodal
11
- pipeline_tag: image-text-to-text
12
  ---
13
 
14
  <div align="center">
@@ -33,8 +32,9 @@ Haoran Yin<sup>2</sup>,
33
  (βœ‰οΈ) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
34
 
35
  <br>
36
- <a href="https://arxiv.org/abs/2502.xxxxx"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
37
  <a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
 
38
 
39
  </div>
40
 
@@ -55,6 +55,102 @@ By synergizing **Sliding Window Attention (SWA)** for fine-grained local percept
55
  * 🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
56
  * πŸ† **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ## Model Zoo
59
 
60
  We release two versions of InfiniteVL-4B to cater to different application scenarios.
@@ -243,9 +339,60 @@ print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]
243
  ```
244
  </details>
245
 
246
- ## πŸŽ₯ Advanced Usage (Cuda Graph)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
247
 
248
- Please refer to the guideline in the [github page](https://github.com/hustvl/InfiniteVL).
 
249
 
250
  ## Citation
251
 
@@ -266,4 +413,4 @@ InfiniteVL is built upon the giants of the open-source community. We would like
266
 
267
  * **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
268
  * **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
269
- * **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.
 
1
  ---
 
2
  library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: image-text-to-text
5
  tags:
6
  - vision-language-model
 
7
  - linear-attention
8
  - gated-deltanet
9
  - infinitevl
10
  - multimodal
 
11
  ---
12
 
13
  <div align="center">
 
32
  (βœ‰οΈ) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
33
 
34
  <br>
35
+ <a href="https://arxiv.org/abs/2512.08829"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
36
  <a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
37
+ <a href="https://huggingface.co/hustvl/InfiniteVL/"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
38
 
39
  </div>
40
 
 
55
  * 🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
56
  * πŸ† **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
57
 
58
+ ## News
59
+ * `Dec. 10th, 2025`: We release the **InfiniteVL** model weights and inference code! Please check [Model Zoo](#model-zoo).
60
+ * `Dec. 10th, 2025`: We release our paper on [Arxiv](https://arxiv.org/abs/2512.08829).
61
+
62
+ ## Table of Contents
63
+
64
+ * [Introduction](#introduction)
65
+ * [Key Highlights](#key-highlights)
66
+ * [News](#news)
67
+ * [Architecture](#architecture)
68
+ * [Training Strategy](#training-strategy)
69
+ * [Performance](#performance)
70
+ * [Model Zoo](#model-zoo)
71
+ * [Getting Started](#getting-started)
72
+ * [Advanced Usage: CUDA Graph Acceleration](#advanced-usage-cuda-graph-acceleration)
73
+ * [Qualitative Analysis & Visualization](#qualitative-analysis--visualization)
74
+ * [Contact](#contact)
75
+ * [Citation](#citation)
76
+ * [Acknowledgement](#acknowledgement)
77
+
78
+ ## Architecture
79
+
80
+ <div align="center">
81
+ <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/architecture.png" alt="InfiniteVL Architecture" width="50%">
82
+ </div>
83
+ <br>
84
+
85
+ **InfiniteVL** adopts a hybrid architecture that synergizes the efficiency of linear attention with the precision of window-based attention. The model comprises a **Vision Encoder** (adapted from Qwen2.5-VL), a **Projection MLP**, and a **Decoder-only LLM Backbone**.
86
+
87
+ ### Key Design Highlights
88
+
89
+ * **Hybrid Block Design**: The LLM backbone consists of **9 Hybrid Blocks**. Within each block, we strategically interleave:
90
+ * **1 Sliding Window Attention (SWA) Layer**: Responsible for capturing high-resolution local context and fine-grained visual details.
91
+ * **3 Gated DeltaNet Layers**: Responsible for modeling long-range global dependencies with linear complexity.
92
+
93
+ * **Constant Memory Footprint**: Unlike traditional Transformers where the Key-Value (KV) cache grows linearly with sequence length ($O(N)$), the **Gated DeltaNet** layers compress history into a fixed-size memory state (e.g., $16 \times 128 \times 256$). This enables **constant memory usage** and constant inference latency, even when processing unlimited input streams.
94
+
95
+ * **Seamless Integration**: By combining SWA and Gated DeltaNet, InfiniteVL achieves the "best of both worlds":
96
+ * Local attention ensures high performance on information-intensive tasks (e.g., OCR, Document Understanding).
97
+ * Linear attention ensures efficiency and stability for long-context scenarios (e.g., Streaming Video Understanding).
98
+
99
+ ## Training Strategy
100
+
101
+ To achieve strong multimodal performance with minimal training resources, InfiniteVL employs a **three-stage progressive training strategy**. This approach allows our linear-complexity model to inherit the vast knowledge of a Transformer teacher before adapting to long-context scenarios.
102
+
103
+ <div align="center">
104
+ <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/training_strategy.png" alt="Training Pipeline" width="90%">
105
+ </div>
106
+
107
+ ### Stage 1: Distillation Pretraining (Efficient Initialization)
108
+ * **Goal:** Rapidly transfer knowledge from the **Qwen2.5-VL** teacher to the InfiniteVL student.
109
+ * **Method:** We replace the teacher's attention layers with **Gated DeltaNet** while keeping other parameters frozen. We use **Layer-wise MSE Loss** (to align internal states) and **End-to-End KL Divergence** (to align output logits).
110
+ * **Significance:** This bypasses the difficulty of training linear attention from scratch, ensuring a robust initialization.
111
+
112
+ ### Stage 2: Instruction SFT (General Capabilities)
113
+ * **Goal:** Unlock strong instruction-following and reasoning capabilities.
114
+ * **Data:** **~8M** diverse multimodal instruction pairs, covering General VQA, OCR, Mathematics, and Code.
115
+ * **Settings:** Image resolution increased to **1344Γ—1344**; max context length set to 8,192.
116
+ * **Outcome:** Produces the **Stage 2 Model**, which offers the best performance on standard benchmarks.
117
+
118
+ ### Stage 3: Long-Sequence SFT (Context Extension)
119
+ * **Goal:** Activate the architecture's potential for **unlimited-length processing** and streaming.
120
+ * **Data:** A mixture of Stage 2 data (800K) and **~200K long-sequence samples** (e.g., long videos, multi-page documents).
121
+ * **Method:** **LoRA** fine-tuning with context length extended to **32,768**.
122
+ * **Outcome:** Produces the **Stage 3 Model**, enabling length extrapolation and stable streaming inference.
123
+
124
+ ## Performance
125
+
126
+ ### πŸš€ Efficiency & Streaming
127
+
128
+ **InfiniteVL** is engineered for unlimited-input scenarios. Unlike Transformer-based models where cost grows linearly with history, InfiniteVL maintains **constant** computational cost and memory usage.
129
+
130
+ > **Hardware Setup:** All efficiency results are measured on a single NVIDIA RTX 4090 GPU.
131
+
132
+ <div align="center">
133
+ <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/plot_line.png" width="80%" alt="Efficiency Comparison">
134
+ <br>
135
+ <em>Figure 1: Comparison of streaming FPS and latency. InfiniteVL sustains real-time performance while Transformer baselines degrade rapidly.</em>
136
+ </div>
137
+
138
+ ### πŸ† Multimodal Benchmarks
139
+
140
+ InfiniteVL achieves state-of-the-art performance among linear-complexity VLMs. Crucially, thanks to our **Hybrid Architecture** and **High-quality training strategies**, it overcomes the traditional weakness of linear models in information-intensive tasks (e.g., OCR, Document Understanding), achieving results comparable to top-tier Transformer VLMs.
141
+
142
+ <div align="center">
143
+ <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance1.png" width="100%" alt="Performance Comparison">
144
+ <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance2.png" width="100%" alt="Performance Comparison">
145
+ <br>
146
+ <em>Figure 2: Comparison of InfiniteVL with existing VLMs on public multimodal understanding, real-world comprehension, text-rich, reasoning-centric multimodal benchmarks.</em>
147
+ </div>
148
+ <br>
149
+
150
+ **Key Takeaways:**
151
+ * **Best-in-Class Linear Model:** Significantly outperforms previous linear VLMs (Cobra, MaTVLM) by large margins (+40-60 points on DocVQA/OCRBench).
152
+ * **Transformer-Level Quality:** Matches the performance of Qwen2.5-VL-3B on complex reasoning and text-rich tasks while being significantly faster in long contexts.
153
+
154
  ## Model Zoo
155
 
156
  We release two versions of InfiniteVL-4B to cater to different application scenarios.
 
339
  ```
340
  </details>
341
 
342
+ ## πŸš€ Advanced Usage: CUDA Graph Acceleration
343
+
344
+ Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for both streaming prefill and decoding, eliminating kernel launch overheads and maximizing GPU utilization.
345
+
346
+ This is the key technology behind our **24 FPS** real-time streaming performance.
347
+
348
+ ### ⚑ Accelerated Streaming Inference
349
+
350
+ Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for streaming prefill, eliminating kernel launch overheads.
351
+
352
+ We provide a complete script in [`examples/demo_streaming_inference.py`](examples/demo_streaming_inference.py) to demonstrate this capability.
353
+
354
+ > **πŸŽ₯ Simulation Note:** This script **simulates a real-time streaming scenario** by reading a local video file frame-by-frame. It treats the video as a continuous data stream, updating the global linear memory state on-the-fly without retraining.
355
+ >
356
+ > **⚠️ Requirement:** This demo relies on the specialized model implementation (supporting `StaticCachePrealloc` and CUDA Graphs) located in the **[`infinitevl/infinitevl_streaming`](infinitevl/infinitevl_streaming)** directory. Please ensure your environment is set up correctly to import these modules.
357
+
358
+ #### 1. Run the Simulation Demo
359
+ ```bash
360
+ # Make sure you are in the project root
361
+ python examples/demo_streaming_inference.py \
362
+ --model_path /path/to/InfiniteVL-4B \
363
+ --video_path assets/demo.mp4 \
364
+ --fps 30
365
+ ```
366
+
367
+ ### ⚑ Accelerated Decode
368
+
369
+ In addition to streaming prefill, InfiniteVL natively supports **CUDA Graph-accelerated decoding**. By capturing the decoding step into a static graph, we can achieve extremely low-latency token generation, further enhancing the responsiveness of real-time interactions.
370
+
371
+ > 🚧 **Coming Soon:** The code for accelerated decoding is currently being refactored and cleaned up. We are working hard to release it as soon as possible. Please stay tuned!
372
+
373
+ ## Qualitative Analysis & Visualization
374
+
375
+ We provide visualization cases to demonstrate InfiniteVL's robust performance across diverse scenarios, ranging from information-intensive static tasks to ultra-long streaming video understanding.
376
+
377
+ ### 1. Fundamental Visual-Language Capabilities (OCR & Reasoning)
378
+ InfiniteVL effectively overcomes the traditional limitations of linear attention in detailed visual perception. By combining Sliding Window Attention with Gated DeltaNet, it excels at **Dense Text Recognition (OCR), Chart Interpretation, and Complex Scene Description**, delivering performance comparable to full-attention Transformers.
379
+
380
+ <div align="center">
381
+ <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image_case1_01.png" width="80%" alt="Fundamental Capabilities">
382
+ </div>
383
+
384
+ ### 2. Long-Term Streaming Understanding
385
+ The core strength of InfiniteVL lies in its ability to maintain coherent memory over **unlimited input streams**.
386
+
387
+ The examples below demonstrate a continuous street-view video stream. InfiniteVL maintains a constant memory state and accurately answers questions at various timestamps (e.g., Frame 3100, ~1M tokens processed), recalling specific details like "NBC Studios" text or the color of a pedestrian's bag without forgetting.
388
+
389
+ <div align="center">
390
+ <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case1_01.png" width="80%" alt="Streaming Capabilities">
391
+ <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case2_01.png" width="80%" alt="Streaming Capabilities">
392
+ </div>
393
 
394
+ ## Contact
395
+ If you have any questions, please contact Hongyuan Tao via email (hongyuantao@hust.edu.cn).
396
 
397
  ## Citation
398
 
 
413
 
414
  * **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
415
  * **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
416
+ * **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.