File size: 6,980 Bytes
e299788 8e7f18b e299788 8e7f18b e299788 8e7f18b b40f27d e3bf9ba b40f27d b61e215 b40f27d d1ed198 8e7f18b 8268fab 8e7f18b 8268fab 8e7f18b 8268fab 8e7f18b b40f27d 8e7f18b b40f27d 8e7f18b b40f27d d1ed198 b40f27d e3bf9ba b40f27d e3bf9ba b40f27d 8268fab e3bf9ba b40f27d e3bf9ba b61e215 e3bf9ba 8268fab e3bf9ba 8268fab e3bf9ba 8268fab e3bf9ba 9018197 e3bf9ba b40f27d d1ed198 b40f27d 8e7f18b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
---
language:
- en
library_name: transformers
license: apache-2.0
metrics:
- accuracy
tags:
- multimodal
pipeline_tag: video-text-to-text
---
# Video-XL-2
[\[📰 Blog\]](https://unabletousegit.github.io/video-xl2.github.io/) [\[📂 GitHub\]](https://github.com/VectorSpaceLab/Video-XL) [\[📜 Tech Report(comming soon)\]]()
## How to use the model
Video-XL-2 supply two efficiency optimization strategy: chunk-based prefill and bi-level kvs decoding. You can flexibly choose them based on your needs.
<!-- # 这里画一个表格说明开启模式带来的影响
# 这里加一个 TODO List -->
TODO
- [X] Release model weights.
- [X] Release the inference code w/o. efficiency optimization.
- [X] Release the inference code w. chunk-based prefill.
- [ ] Release the inference code w. chunk-based prefill & bi-level kvs decoding.
**Tips: Our inference code still under updating, you could update it by assign "--include '\*.py'" in huggingface-cli to only update the inference code, avoid downloading the whole model.*
---
### 0. Installing Required Packages
```bash
pip install transformers==4.43.0
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install decord
pip install einops
pip install opencv-python
pip install accelerate==0.30.0
pip install numpy==1.26.4
# optional
pip install flash-attn --no-build-isolation
```
---
### 1. Inference w/o. Efficiency Optimization
```python
from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig, AutoModelForCausalLM
import torch
# load model
model_path = '/root/Models/Video-XL-2'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map=device,quantization_config=None, attn_implementation="sdpa", torch_dtype=torch.float16, low_cpu_mem_usage=True)
gen_kwargs = {
"do_sample": False,
"temperature": 0.01,
"top_p": 0.001,
"num_beams": 1,
"use_cache": True,
"max_new_tokens": 256
}
model.config.enable_sparse = False
# input data
video_path = "/asset/demo.mp4"
question1 = "How many people in the video? (A)3 people (B)6 people. Please only respone the letter"
# params
max_num_frames = 150
sample_fps = 1 # extract frame at 1fps
max_sample_fps = 4
with torch.inference_mode():
response = model.chat(video_path, tokenizer, question1, chat_history=None, return_history=False,max_num_frames=max_num_frames, sample_fps=sample_fps, max_sample_fps=max_sample_fps, generation_config=gen_kwargs)
print(response)
```
---
### 2. Inference w. Chunk-based Pre-filling
Chunk-based prefill significantly reduces memory demands and response latency by encoding video input in a streaming manner. This advantage becomes particularly noticeable with longer videos.
To enable this mode, you need to set `enable_chunk_prefill` to `True` and configure the `prefill_config` parameters:
* **`chunk_prefill_mode`**: This defines the mode of chunk-based prefill. We currently support two modes:
* **`streaming`**: This mode encodes video chunks streamingly.
* **`mask`**: This mode achieves an equivalent effect using an attention mask. However, due to a lack of underlying optimized operators, the `mask` mode doesn't offer any efficiency improvements at this time. We recommend using the `streaming` mode.
* **`chunk_size`**: This parameter specifies the size of each chunk processed in a single forward pass. The unit for `chunk_size` is **4 frames** (e.g., `chunk_size = 4` means processing visual tokens from **4×4 = 16 frames** at once). A larger `chunk_size` will gradually approach full attention, resulting in a higher peak memory usage.
* **`step_size`**: This controls the step size between chunks. A smaller `step_size` leads to more continuous information transfer between chunks but may slightly decrease inference speed.
* **`offload`**: This boolean parameter determines whether to offload the key-value states (KVs) of each chunk to the CPU during forwarding. While this can reduce memory usage, it will also lower the inference speed.
* **`chunk_size_for_vision_tower`**: For longer video inputs, the vision tower can become a memory bottleneck during forwarding. To mitigate this, we also support a streaming mode for the vision tower, which is controlled by this parameter. The unit for `chunk_size_for_vision_tower` is **1 frames**. And, the value of `chunk_size_for_vision_tower` must be **a multiple of 4**.
**Tip: Currently, chunk-based prefill only supports the 'sdpa' attention implementation.*
```python
from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig, AutoModelForCausalLM
import torch
import pdb
import argparse
torch.cuda.reset_peak_memory_stats()
# load model
model_path = '/root/Models/Video-XL-2'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map=device,quantization_config=None, attn_implementation="sdpa", torch_dtype=torch.float16, low_cpu_mem_usage=True) # sdpa
gen_kwargs = {"do_sample": False, "temperature": 0.01, "top_p": 0.001, "num_beams": 1, "use_cache": True, "max_new_tokens": 128}
model.config.enable_chunk_prefill = True
prefill_config = {
'chunk_prefill_mode': 'streaming',
'chunk_size': 4,
'step_size': 1,
'offload': True,
'chunk_size_for_vision_tower': 24,
}
model.config.prefill_config = prefill_config
# input data
video_path = "/asset/demo.mp4"
question1 = "How many people in the video? (A)3 people (B)6 people. Please only respone the letter"
# params
max_num_frames = 1300
sample_fps = None # uniform sampling
max_sample_fps = None
with torch.inference_mode():
response = model.chat(video_path, tokenizer, question1, chat_history=None, return_history=False,max_num_frames=max_num_frames, sample_fps=sample_fps, max_sample_fps=max_sample_fps, generation_config=gen_kwargs)
peak_memory_allocated = torch.cuda.max_memory_allocated()
print(f"Memory Peak: {peak_memory_allocated / (1024**3):.2f} GB")
print(response)
```
---
### 3. Inference w. Chunk-based Pre-filling & Bi-level KVs Decoding
coming soon
```python
```
## ✏️ Citation
```bibtex
@article{shu2024video,
title={Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding},
author={Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo},
journal={arXiv preprint arXiv:2409.14485},
year={2024}
}
@article{liu2025video,
title={Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding},
author={Liu, Xiangrui and Shu, Yan and Liu, Zheng and Li, Ao and Tian, Yang and Zhao, Bo},
journal={arXiv preprint arXiv:2503.18478},
year={2025}
}
``` |