Improve model card: Update arXiv link and add comprehensive details from GitHub

Hello team,

I've opened this PR to enhance the model card for the InfiniteVL model. The updates aim to provide more comprehensive and accurate information for users.

Key changes include:
- **Corrected arXiv paper link**: The previous placeholder `https://arxiv.org/abs/2502.xxxxx` has been updated to the correct link `https://arxiv.org/abs/2512.08829`, consistent with the paper info and GitHub repository.
- **Added Hugging Face badge**: A Hugging Face badge has been included in the header to improve navigation and visibility of the model on the Hub.
- **Enriched content from GitHub README**: Several crucial sections from the project's GitHub README have been integrated into the model card, including:
- `News`
- `Table of Contents` (updated to reflect the model card structure)
- `Architecture` (including relevant images with absolute GitHub URLs)
- `Training Strategy` (including relevant images with absolute GitHub URLs)
- `Performance` (including relevant images with absolute GitHub URLs)
- `Qualitative Analysis & Visualization` (including relevant images with absolute GitHub URLs)
- `Contact`
- **Detailed Advanced Usage**: The brief "Advanced Usage (Cuda Graph)" section has been replaced with the more detailed explanation and code snippets from the GitHub README.
- **Refined Metadata Tags**: Removed the redundant `image-text-to-text` tag from the `tags` list, as it is already present in `pipeline_tag`.

These changes ensure the model card is more informative, accurate, and easier to navigate for anyone exploring InfiniteVL.
All relative image paths from the GitHub README have been converted to absolute `https://github.com/hustvl/InfiniteVL/raw/main/assets/...` URLs.

Files changed (1) hide show

README.md +154 -7

README.md CHANGED Viewed

@@ -1,14 +1,13 @@
 ---
-license: apache-2.0
 library_name: transformers
 tags:
 - vision-language-model
-- image-text-to-text
 - linear-attention
 - gated-deltanet
 - infinitevl
 - multimodal
-pipeline_tag: image-text-to-text
 ---
 <div align="center">
@@ -33,8 +32,9 @@ Haoran Yin<sup>2</sup>,
 (✉️) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
 <br>
-<a href="https://arxiv.org/abs/2502.xxxxx"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
 <a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
 </div>
@@ -55,6 +55,102 @@ By synergizing **Sliding Window Attention (SWA)** for fine-grained local percept
 *   🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
 *   🏆 **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
 ## Model Zoo
 We release two versions of InfiniteVL-4B to cater to different application scenarios.
@@ -243,9 +339,60 @@ print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]
 ```
 </details>
-## 🎥 Advanced Usage (Cuda Graph)
-Please refer to the guideline in the [github page](https://github.com/hustvl/InfiniteVL).
 ## Citation
@@ -266,4 +413,4 @@ InfiniteVL is built upon the giants of the open-source community. We would like
 *   **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
 *   **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
-*   **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.

 ---
 library_name: transformers
+license: apache-2.0
+pipeline_tag: image-text-to-text
 tags:
 - vision-language-model
 - linear-attention
 - gated-deltanet
 - infinitevl
 - multimodal
 ---
 <div align="center">
 (✉️) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
 <br>
+<a href="https://arxiv.org/abs/2512.08829"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
 <a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
+<a href="https://huggingface.co/hustvl/InfiniteVL/"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
 </div>
 *   🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
 *   🏆 **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
+## News
+*   `Dec. 10th, 2025`: We release the **InfiniteVL** model weights and inference code! Please check [Model Zoo](#model-zoo).
+*   `Dec. 10th, 2025`: We release our paper on [Arxiv](https://arxiv.org/abs/2512.08829).
+## Table of Contents
+*   [Introduction](#introduction)
+*   [Key Highlights](#key-highlights)
+*   [News](#news)
+*   [Architecture](#architecture)
+*   [Training Strategy](#training-strategy)
+*   [Performance](#performance)
+*   [Model Zoo](#model-zoo)
+*   [Getting Started](#getting-started)
+*   [Advanced Usage: CUDA Graph Acceleration](#advanced-usage-cuda-graph-acceleration)
+*   [Qualitative Analysis & Visualization](#qualitative-analysis--visualization)
+*   [Contact](#contact)
+*   [Citation](#citation)
+*   [Acknowledgement](#acknowledgement)
+## Architecture
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/architecture.png" alt="InfiniteVL Architecture" width="50%">
+</div>
+<br>
+**InfiniteVL** adopts a hybrid architecture that synergizes the efficiency of linear attention with the precision of window-based attention. The model comprises a **Vision Encoder** (adapted from Qwen2.5-VL), a **Projection MLP**, and a **Decoder-only LLM Backbone**.
+### Key Design Highlights
+*   **Hybrid Block Design**: The LLM backbone consists of **9 Hybrid Blocks**. Within each block, we strategically interleave:
+    *   **1 Sliding Window Attention (SWA) Layer**: Responsible for capturing high-resolution local context and fine-grained visual details.
+    *   **3 Gated DeltaNet Layers**: Responsible for modeling long-range global dependencies with linear complexity.
+*   **Constant Memory Footprint**: Unlike traditional Transformers where the Key-Value (KV) cache grows linearly with sequence length ($O(N)$), the **Gated DeltaNet** layers compress history into a fixed-size memory state (e.g., $16 \times 128 \times 256$). This enables **constant memory usage** and constant inference latency, even when processing unlimited input streams.
+*   **Seamless Integration**: By combining SWA and Gated DeltaNet, InfiniteVL achieves the "best of both worlds":
+    *   Local attention ensures high performance on information-intensive tasks (e.g., OCR, Document Understanding).
+    *   Linear attention ensures efficiency and stability for long-context scenarios (e.g., Streaming Video Understanding).
+## Training Strategy
+To achieve strong multimodal performance with minimal training resources, InfiniteVL employs a **three-stage progressive training strategy**. This approach allows our linear-complexity model to inherit the vast knowledge of a Transformer teacher before adapting to long-context scenarios.
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/training_strategy.png" alt="Training Pipeline" width="90%">
+</div>
+### Stage 1: Distillation Pretraining (Efficient Initialization)
+*   **Goal:** Rapidly transfer knowledge from the **Qwen2.5-VL** teacher to the InfiniteVL student.
+*   **Method:** We replace the teacher's attention layers with **Gated DeltaNet** while keeping other parameters frozen. We use **Layer-wise MSE Loss** (to align internal states) and **End-to-End KL Divergence** (to align output logits).
+*   **Significance:** This bypasses the difficulty of training linear attention from scratch, ensuring a robust initialization.
+### Stage 2: Instruction SFT (General Capabilities)
+*   **Goal:** Unlock strong instruction-following and reasoning capabilities.
+*   **Data:** **~8M** diverse multimodal instruction pairs, covering General VQA, OCR, Mathematics, and Code.
+*   **Settings:** Image resolution increased to **1344×1344**; max context length set to 8,192.
+*   **Outcome:** Produces the **Stage 2 Model**, which offers the best performance on standard benchmarks.
+### Stage 3: Long-Sequence SFT (Context Extension)
+*   **Goal:** Activate the architecture's potential for **unlimited-length processing** and streaming.
+*   **Data:** A mixture of Stage 2 data (800K) and **~200K long-sequence samples** (e.g., long videos, multi-page documents).
+*   **Method:** **LoRA** fine-tuning with context length extended to **32,768**.
+*   **Outcome:** Produces the **Stage 3 Model**, enabling length extrapolation and stable streaming inference.
+## Performance
+### 🚀 Efficiency & Streaming
+**InfiniteVL** is engineered for unlimited-input scenarios. Unlike Transformer-based models where cost grows linearly with history, InfiniteVL maintains **constant** computational cost and memory usage.
+> **Hardware Setup:** All efficiency results are measured on a single NVIDIA RTX 4090 GPU.
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/plot_line.png" width="80%" alt="Efficiency Comparison">
+  <br>
+  <em>Figure 1: Comparison of streaming FPS and latency. InfiniteVL sustains real-time performance while Transformer baselines degrade rapidly.</em>
+</div>
+### 🏆 Multimodal Benchmarks
+InfiniteVL achieves state-of-the-art performance among linear-complexity VLMs. Crucially, thanks to our **Hybrid Architecture** and **High-quality training strategies**, it overcomes the traditional weakness of linear models in information-intensive tasks (e.g., OCR, Document Understanding), achieving results comparable to top-tier Transformer VLMs.
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance1.png" width="100%" alt="Performance Comparison">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance2.png" width="100%" alt="Performance Comparison">
+  <br>
+  <em>Figure 2: Comparison of InfiniteVL with existing VLMs on public multimodal understanding, real-world comprehension, text-rich, reasoning-centric multimodal benchmarks.</em>
+</div>
+<br>
+**Key Takeaways:**
+*   **Best-in-Class Linear Model:** Significantly outperforms previous linear VLMs (Cobra, MaTVLM) by large margins (+40-60 points on DocVQA/OCRBench).
+*   **Transformer-Level Quality:** Matches the performance of Qwen2.5-VL-3B on complex reasoning and text-rich tasks while being significantly faster in long contexts.
 ## Model Zoo
 We release two versions of InfiniteVL-4B to cater to different application scenarios.
 ```
 </details>
+## 🚀 Advanced Usage: CUDA Graph Acceleration
+Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for both streaming prefill and decoding, eliminating kernel launch overheads and maximizing GPU utilization.
+This is the key technology behind our **24 FPS** real-time streaming performance.
+### ⚡ Accelerated Streaming Inference
+Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for streaming prefill, eliminating kernel launch overheads.
+We provide a complete script in [`examples/demo_streaming_inference.py`](examples/demo_streaming_inference.py) to demonstrate this capability.
+> **🎥 Simulation Note:** This script **simulates a real-time streaming scenario** by reading a local video file frame-by-frame. It treats the video as a continuous data stream, updating the global linear memory state on-the-fly without retraining.
+>
+> **⚠️ Requirement:** This demo relies on the specialized model implementation (supporting `StaticCachePrealloc` and CUDA Graphs) located in the **[`infinitevl/infinitevl_streaming`](infinitevl/infinitevl_streaming)** directory. Please ensure your environment is set up correctly to import these modules.
+#### 1. Run the Simulation Demo
+```bash
+# Make sure you are in the project root
+python examples/demo_streaming_inference.py \
+    --model_path /path/to/InfiniteVL-4B \
+    --video_path assets/demo.mp4 \
+    --fps 30
+```
+### ⚡ Accelerated Decode
+In addition to streaming prefill, InfiniteVL natively supports **CUDA Graph-accelerated decoding**. By capturing the decoding step into a static graph, we can achieve extremely low-latency token generation, further enhancing the responsiveness of real-time interactions.
+> 🚧 **Coming Soon:** The code for accelerated decoding is currently being refactored and cleaned up. We are working hard to release it as soon as possible. Please stay tuned!
+## Qualitative Analysis & Visualization
+We provide visualization cases to demonstrate InfiniteVL's robust performance across diverse scenarios, ranging from information-intensive static tasks to ultra-long streaming video understanding.
+### 1. Fundamental Visual-Language Capabilities (OCR & Reasoning)
+InfiniteVL effectively overcomes the traditional limitations of linear attention in detailed visual perception. By combining Sliding Window Attention with Gated DeltaNet, it excels at **Dense Text Recognition (OCR), Chart Interpretation, and Complex Scene Description**, delivering performance comparable to full-attention Transformers.
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image_case1_01.png" width="80%" alt="Fundamental Capabilities">
+</div>
+### 2. Long-Term Streaming Understanding
+The core strength of InfiniteVL lies in its ability to maintain coherent memory over **unlimited input streams**.
+The examples below demonstrate a continuous street-view video stream. InfiniteVL maintains a constant memory state and accurately answers questions at various timestamps (e.g., Frame 3100, ~1M tokens processed), recalling specific details like "NBC Studios" text or the color of a pedestrian's bag without forgetting.
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case1_01.png" width="80%" alt="Streaming Capabilities">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case2_01.png" width="80%" alt="Streaming Capabilities">
+</div>
+## Contact
+If you have any questions, please contact Hongyuan Tao via email (hongyuantao@hust.edu.cn).
 ## Citation
 *   **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
 *   **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
+*   **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.