Instructions to use hustvl/InfiniteVL-LongSFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use hustvl/InfiniteVL-LongSFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="hustvl/InfiniteVL-LongSFT", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("hustvl/InfiniteVL-LongSFT", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use hustvl/InfiniteVL-LongSFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "hustvl/InfiniteVL-LongSFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hustvl/InfiniteVL-LongSFT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/hustvl/InfiniteVL-LongSFT

SGLang

How to use hustvl/InfiniteVL-LongSFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "hustvl/InfiniteVL-LongSFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hustvl/InfiniteVL-LongSFT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "hustvl/InfiniteVL-LongSFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hustvl/InfiniteVL-LongSFT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use hustvl/InfiniteVL-LongSFT with Docker Model Runner:
```
docker model run hf.co/hustvl/InfiniteVL-LongSFT
```

Improve model card: Update arXiv link and add comprehensive details from GitHub

by nielsr HF Staff - opened Dec 12, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+154

-7

Files changed (1) hide show

README.md +154 -7

README.md CHANGED Viewed

@@ -1,14 +1,13 @@
 ---
-license: apache-2.0
 library_name: transformers
 tags:
 - vision-language-model
-- image-text-to-text
 - linear-attention
 - gated-deltanet
 - infinitevl
 - multimodal
-pipeline_tag: image-text-to-text
 ---
 <div align="center">
@@ -33,8 +32,9 @@ Haoran Yin<sup>2</sup>,
 (✉️) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
 <br>
-<a href="https://arxiv.org/abs/2502.xxxxx"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
 <a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
 </div>
@@ -55,6 +55,102 @@ By synergizing **Sliding Window Attention (SWA)** for fine-grained local percept
 *   🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
 *   🏆 **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
 ## Model Zoo
 We release two versions of InfiniteVL-4B to cater to different application scenarios.
@@ -243,9 +339,60 @@ print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]
 ```
 </details>
-## 🎥 Advanced Usage (Cuda Graph)
-Please refer to the guideline in the [github page](https://github.com/hustvl/InfiniteVL).
 ## Citation
@@ -266,4 +413,4 @@ InfiniteVL is built upon the giants of the open-source community. We would like
 *   **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
 *   **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
-*   **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.

 ---
 library_name: transformers
+license: apache-2.0
+pipeline_tag: image-text-to-text
 tags:
 - vision-language-model
 - linear-attention
 - gated-deltanet
 - infinitevl
 - multimodal
 ---
 <div align="center">
 (✉️) corresponding author: <a href="mailto:xgwang@hust.edu.cn">xgwang@hust.edu.cn</a>
 <br>
+<a href="https://arxiv.org/abs/2512.08829"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg" alt="arXiv"></a>
 <a href="https://github.com/hustvl/InfiniteVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a>
+<a href="https://huggingface.co/hustvl/InfiniteVL/"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
 </div>
 *   🧠 **Unlimited Context:** Effectively retains context over extremely long sequences (tested >500K tokens) without OOM errors.
 *   🏆 **Strong Performance:** Matches leading Transformer-based VLMs (e.g., Qwen2.5-VL-3B) and significantly outperforms previous linear VLMs (e.g., VL-Mamba, Cobra) on comprehensive aspects.
+## News
+*   `Dec. 10th, 2025`: We release the **InfiniteVL** model weights and inference code! Please check [Model Zoo](#model-zoo).
+*   `Dec. 10th, 2025`: We release our paper on [Arxiv](https://arxiv.org/abs/2512.08829).
+## Table of Contents
+*   [Introduction](#introduction)
+*   [Key Highlights](#key-highlights)
+*   [News](#news)
+*   [Architecture](#architecture)
+*   [Training Strategy](#training-strategy)
+*   [Performance](#performance)
+*   [Model Zoo](#model-zoo)
+*   [Getting Started](#getting-started)
+*   [Advanced Usage: CUDA Graph Acceleration](#advanced-usage-cuda-graph-acceleration)
+*   [Qualitative Analysis & Visualization](#qualitative-analysis--visualization)
+*   [Contact](#contact)
+*   [Citation](#citation)
+*   [Acknowledgement](#acknowledgement)
+## Architecture
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/architecture.png" alt="InfiniteVL Architecture" width="50%">
+</div>
+<br>
+**InfiniteVL** adopts a hybrid architecture that synergizes the efficiency of linear attention with the precision of window-based attention. The model comprises a **Vision Encoder** (adapted from Qwen2.5-VL), a **Projection MLP**, and a **Decoder-only LLM Backbone**.
+### Key Design Highlights
+*   **Hybrid Block Design**: The LLM backbone consists of **9 Hybrid Blocks**. Within each block, we strategically interleave:
+    *   **1 Sliding Window Attention (SWA) Layer**: Responsible for capturing high-resolution local context and fine-grained visual details.
+    *   **3 Gated DeltaNet Layers**: Responsible for modeling long-range global dependencies with linear complexity.
+*   **Constant Memory Footprint**: Unlike traditional Transformers where the Key-Value (KV) cache grows linearly with sequence length ($O(N)$), the **Gated DeltaNet** layers compress history into a fixed-size memory state (e.g., $16 \times 128 \times 256$). This enables **constant memory usage** and constant inference latency, even when processing unlimited input streams.
+*   **Seamless Integration**: By combining SWA and Gated DeltaNet, InfiniteVL achieves the "best of both worlds":
+    *   Local attention ensures high performance on information-intensive tasks (e.g., OCR, Document Understanding).
+    *   Linear attention ensures efficiency and stability for long-context scenarios (e.g., Streaming Video Understanding).
+## Training Strategy
+To achieve strong multimodal performance with minimal training resources, InfiniteVL employs a **three-stage progressive training strategy**. This approach allows our linear-complexity model to inherit the vast knowledge of a Transformer teacher before adapting to long-context scenarios.
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/training_strategy.png" alt="Training Pipeline" width="90%">
+</div>
+### Stage 1: Distillation Pretraining (Efficient Initialization)
+*   **Goal:** Rapidly transfer knowledge from the **Qwen2.5-VL** teacher to the InfiniteVL student.
+*   **Method:** We replace the teacher's attention layers with **Gated DeltaNet** while keeping other parameters frozen. We use **Layer-wise MSE Loss** (to align internal states) and **End-to-End KL Divergence** (to align output logits).
+*   **Significance:** This bypasses the difficulty of training linear attention from scratch, ensuring a robust initialization.
+### Stage 2: Instruction SFT (General Capabilities)
+*   **Goal:** Unlock strong instruction-following and reasoning capabilities.
+*   **Data:** **~8M** diverse multimodal instruction pairs, covering General VQA, OCR, Mathematics, and Code.
+*   **Settings:** Image resolution increased to **1344×1344**; max context length set to 8,192.
+*   **Outcome:** Produces the **Stage 2 Model**, which offers the best performance on standard benchmarks.
+### Stage 3: Long-Sequence SFT (Context Extension)
+*   **Goal:** Activate the architecture's potential for **unlimited-length processing** and streaming.
+*   **Data:** A mixture of Stage 2 data (800K) and **~200K long-sequence samples** (e.g., long videos, multi-page documents).
+*   **Method:** **LoRA** fine-tuning with context length extended to **32,768**.
+*   **Outcome:** Produces the **Stage 3 Model**, enabling length extrapolation and stable streaming inference.
+## Performance
+### 🚀 Efficiency & Streaming
+**InfiniteVL** is engineered for unlimited-input scenarios. Unlike Transformer-based models where cost grows linearly with history, InfiniteVL maintains **constant** computational cost and memory usage.
+> **Hardware Setup:** All efficiency results are measured on a single NVIDIA RTX 4090 GPU.
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/plot_line.png" width="80%" alt="Efficiency Comparison">
+  <br>
+  <em>Figure 1: Comparison of streaming FPS and latency. InfiniteVL sustains real-time performance while Transformer baselines degrade rapidly.</em>
+</div>
+### 🏆 Multimodal Benchmarks
+InfiniteVL achieves state-of-the-art performance among linear-complexity VLMs. Crucially, thanks to our **Hybrid Architecture** and **High-quality training strategies**, it overcomes the traditional weakness of linear models in information-intensive tasks (e.g., OCR, Document Understanding), achieving results comparable to top-tier Transformer VLMs.
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance1.png" width="100%" alt="Performance Comparison">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/performance2.png" width="100%" alt="Performance Comparison">
+  <br>
+  <em>Figure 2: Comparison of InfiniteVL with existing VLMs on public multimodal understanding, real-world comprehension, text-rich, reasoning-centric multimodal benchmarks.</em>
+</div>
+<br>
+**Key Takeaways:**
+*   **Best-in-Class Linear Model:** Significantly outperforms previous linear VLMs (Cobra, MaTVLM) by large margins (+40-60 points on DocVQA/OCRBench).
+*   **Transformer-Level Quality:** Matches the performance of Qwen2.5-VL-3B on complex reasoning and text-rich tasks while being significantly faster in long contexts.
 ## Model Zoo
 We release two versions of InfiniteVL-4B to cater to different application scenarios.
 ```
 </details>
+## 🚀 Advanced Usage: CUDA Graph Acceleration
+Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for both streaming prefill and decoding, eliminating kernel launch overheads and maximizing GPU utilization.
+This is the key technology behind our **24 FPS** real-time streaming performance.
+### ⚡ Accelerated Streaming Inference
+Unlike Transformer-based VLMs where the KV cache grows dynamically, **InfiniteVL maintains a constant-size memory state**. This unique property allows us to use **CUDA Graphs** to capture the entire computation graph for streaming prefill, eliminating kernel launch overheads.
+We provide a complete script in [`examples/demo_streaming_inference.py`](examples/demo_streaming_inference.py) to demonstrate this capability.
+> **🎥 Simulation Note:** This script **simulates a real-time streaming scenario** by reading a local video file frame-by-frame. It treats the video as a continuous data stream, updating the global linear memory state on-the-fly without retraining.
+>
+> **⚠️ Requirement:** This demo relies on the specialized model implementation (supporting `StaticCachePrealloc` and CUDA Graphs) located in the **[`infinitevl/infinitevl_streaming`](infinitevl/infinitevl_streaming)** directory. Please ensure your environment is set up correctly to import these modules.
+#### 1. Run the Simulation Demo
+```bash
+# Make sure you are in the project root
+python examples/demo_streaming_inference.py \
+    --model_path /path/to/InfiniteVL-4B \
+    --video_path assets/demo.mp4 \
+    --fps 30
+```
+### ⚡ Accelerated Decode
+In addition to streaming prefill, InfiniteVL natively supports **CUDA Graph-accelerated decoding**. By capturing the decoding step into a static graph, we can achieve extremely low-latency token generation, further enhancing the responsiveness of real-time interactions.
+> 🚧 **Coming Soon:** The code for accelerated decoding is currently being refactored and cleaned up. We are working hard to release it as soon as possible. Please stay tuned!
+## Qualitative Analysis & Visualization
+We provide visualization cases to demonstrate InfiniteVL's robust performance across diverse scenarios, ranging from information-intensive static tasks to ultra-long streaming video understanding.
+### 1. Fundamental Visual-Language Capabilities (OCR & Reasoning)
+InfiniteVL effectively overcomes the traditional limitations of linear attention in detailed visual perception. By combining Sliding Window Attention with Gated DeltaNet, it excels at **Dense Text Recognition (OCR), Chart Interpretation, and Complex Scene Description**, delivering performance comparable to full-attention Transformers.
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/image_case1_01.png" width="80%" alt="Fundamental Capabilities">
+</div>
+### 2. Long-Term Streaming Understanding
+The core strength of InfiniteVL lies in its ability to maintain coherent memory over **unlimited input streams**.
+The examples below demonstrate a continuous street-view video stream. InfiniteVL maintains a constant memory state and accurately answers questions at various timestamps (e.g., Frame 3100, ~1M tokens processed), recalling specific details like "NBC Studios" text or the color of a pedestrian's bag without forgetting.
+<div align="center">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case1_01.png" width="80%" alt="Streaming Capabilities">
+  <img src="https://github.com/hustvl/InfiniteVL/raw/main/assets/streaming_case2_01.png" width="80%" alt="Streaming Capabilities">
+</div>
+## Contact
+If you have any questions, please contact Hongyuan Tao via email (hongyuantao@hust.edu.cn).
 ## Citation
 *   **[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)**: For providing a powerful vision-language codebase and vision encoder.
 *   **[Gated DeltaNet](https://github.com/sustcsonglin/flash-linear-attention)**: For the efficient linear attention mechanism and CUDA kernel implementations (FLA).
+*   **Open-Source Datasets**: We sincerely thank the creators of the high-quality datasets used in our training, including **FineVision, LLaVA-OneVision, PixMo, The Cauldron, Docmatix, LLaVA-Video**, and others. Their contributions are essential to the development of efficient multimodal models.