File size: 6,980 Bytes

e299788
8e7f18b
 
 
e299788
8e7f18b
 
 
 
 
e299788
 
8e7f18b
 
 
 
b40f27d
 
 
 
 
 
 
e3bf9ba
 
b40f27d
 
 
 
b61e215
 
 
 
 
 
 
 
 
 
 
 
 
 
b40f27d
d1ed198
8e7f18b
8268fab
8e7f18b
 
 
 
 
 
8268fab
8e7f18b
 
8268fab
8e7f18b
 
 
 
 
 
 
b40f27d
8e7f18b
 
 
 
 
 
b40f27d
8e7f18b
 
 
 
 
 
 
 
 
b40f27d
d1ed198
b40f27d
 
 
 
 
 
e3bf9ba
b40f27d
 
e3bf9ba
b40f27d
 
 
 
8268fab
e3bf9ba
 
 
b40f27d
e3bf9ba
 
b61e215
e3bf9ba
 
8268fab
e3bf9ba
 
 
 
 
 
 
 
 
 
 
 
 
 
8268fab
e3bf9ba
 
 
 
8268fab
e3bf9ba
 
 
 
 
 
 
9018197
e3bf9ba
b40f27d
 
 
d1ed198
b40f27d
 
 
 
 
 
 
8e7f18b

---
language:
- en
library_name: transformers
license: apache-2.0
metrics:
- accuracy
tags:
- multimodal
pipeline_tag: video-text-to-text
---

# Video-XL-2
[\[📰 Blog\]](https://unabletousegit.github.io/video-xl2.github.io/) [\[📂 GitHub\]](https://github.com/VectorSpaceLab/Video-XL)   [\[📜 Tech Report(comming soon)\]]()


## How to use the model
Video-XL-2 supply two efficiency optimization strategy: chunk-based prefill and bi-level kvs decoding. You can flexibly choose them based on your needs.
<!-- # 这里画一个表格说明开启模式带来的影响
# 这里加一个 TODO List -->

TODO
- [X] Release model weights.
- [X] Release the inference code w/o. efficiency optimization.
- [X] Release the inference code w. chunk-based prefill.
- [ ] Release the inference code w. chunk-based prefill & bi-level kvs decoding.

**Tips: Our inference code still under updating, you could update it by assign "--include '\*.py'" in huggingface-cli to only update the inference code, avoid downloading the whole model.*

---
### 0. Installing Required Packages
```bash
pip install transformers==4.43.0
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install decord
pip install einops
pip install opencv-python
pip install accelerate==0.30.0
pip install numpy==1.26.4
# optional
pip install flash-attn --no-build-isolation
```

---
### 1. Inference w/o. Efficiency Optimization
```python
from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig, AutoModelForCausalLM
import torch

# load model 
model_path = '/root/Models/Video-XL-2'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map=device,quantization_config=None, attn_implementation="sdpa", torch_dtype=torch.float16, low_cpu_mem_usage=True)

gen_kwargs = {
    "do_sample": False,
    "temperature": 0.01,
    "top_p": 0.001,
    "num_beams": 1,
    "use_cache": True,
    "max_new_tokens": 256
}

model.config.enable_sparse = False

# input data
video_path = "/asset/demo.mp4"
question1 = "How many people in the video? (A)3 people (B)6 people. Please only respone the letter"

# params
max_num_frames = 150
sample_fps = 1  # extract frame at 1fps
max_sample_fps = 4

with torch.inference_mode():
    response = model.chat(video_path, tokenizer, question1, chat_history=None, return_history=False,max_num_frames=max_num_frames, sample_fps=sample_fps, max_sample_fps=max_sample_fps, generation_config=gen_kwargs)
    
print(response)
```

---
### 2. Inference w. Chunk-based Pre-filling
Chunk-based prefill significantly reduces memory demands and response latency by encoding video input in a streaming manner. This advantage becomes particularly noticeable with longer videos.

To enable this mode, you need to set `enable_chunk_prefill` to `True` and configure the `prefill_config` parameters:
* **`chunk_prefill_mode`**: This defines the mode of chunk-based prefill. We currently support two modes:
    * **`streaming`**: This mode encodes video chunks streamingly.
    * **`mask`**: This mode achieves an equivalent effect using an attention mask. However, due to a lack of underlying optimized operators, the `mask` mode doesn't offer any efficiency improvements at this time. We recommend using the `streaming` mode.
* **`chunk_size`**: This parameter specifies the size of each chunk processed in a single forward pass. The unit for `chunk_size` is **4 frames** (e.g., `chunk_size = 4` means processing visual tokens from **4×4 = 16 frames** at once). A larger `chunk_size` will gradually approach full attention, resulting in a higher peak memory usage.
* **`step_size`**: This controls the step size between chunks. A smaller `step_size` leads to more continuous information transfer between chunks but may slightly decrease inference speed.
* **`offload`**: This boolean parameter determines whether to offload the key-value states (KVs) of each chunk to the CPU during forwarding. While this can reduce memory usage, it will also lower the inference speed.
* **`chunk_size_for_vision_tower`**: For longer video inputs, the vision tower can become a memory bottleneck during forwarding. To mitigate this, we also support a streaming mode for the vision tower, which is controlled by this parameter. The unit for `chunk_size_for_vision_tower` is **1 frames**. And, the value of `chunk_size_for_vision_tower` must be **a multiple of 4**.

**Tip: Currently, chunk-based prefill only supports the 'sdpa' attention implementation.*

```python
from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig, AutoModelForCausalLM
import torch
import pdb
import argparse

torch.cuda.reset_peak_memory_stats()
# load model 
model_path = '/root/Models/Video-XL-2'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map=device,quantization_config=None, attn_implementation="sdpa", torch_dtype=torch.float16, low_cpu_mem_usage=True) # sdpa

gen_kwargs = {"do_sample": False, "temperature": 0.01, "top_p": 0.001, "num_beams": 1, "use_cache": True, "max_new_tokens": 128}

model.config.enable_chunk_prefill = True
prefill_config = {
    'chunk_prefill_mode': 'streaming',
    'chunk_size': 4,
    'step_size': 1,
    'offload': True,
    'chunk_size_for_vision_tower': 24,
}
model.config.prefill_config = prefill_config

# input data
video_path = "/asset/demo.mp4"
question1 = "How many people in the video? (A)3 people (B)6 people. Please only respone the letter"

# params
max_num_frames = 1300
sample_fps = None  # uniform sampling
max_sample_fps = None

with torch.inference_mode():
    response = model.chat(video_path, tokenizer, question1, chat_history=None, return_history=False,max_num_frames=max_num_frames, sample_fps=sample_fps, max_sample_fps=max_sample_fps, generation_config=gen_kwargs)
    

peak_memory_allocated = torch.cuda.max_memory_allocated()
print(f"Memory Peak: {peak_memory_allocated / (1024**3):.2f} GB")
print(response)
```

---
### 3. Inference w. Chunk-based Pre-filling & Bi-level KVs Decoding
coming soon
```python

```



## ✏️ Citation

```bibtex
@article{shu2024video,
  title={Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding},
  author={Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo},
  journal={arXiv preprint arXiv:2409.14485},
  year={2024}
}

@article{liu2025video,
  title={Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding},
  author={Liu, Xiangrui and Shu, Yan and Liu, Zheng and Li, Ao and Tian, Yang and Zhao, Bo},
  journal={arXiv preprint arXiv:2503.18478},
  year={2025}
}
```