|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- vision-language model |
|
|
- llama |
|
|
- video understanding |
|
|
pipeline_tag: video-text-to-text |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Flash-VStream Model Card |
|
|
|
|
|
This repository contains the Flash-VStream model presented in the paper [Flash-VStream: Efficient Real-Time Understanding for Long Video Streams](https://huggingface.co/papers/2506.23825). |
|
|
|
|
|
<a href='https://zhang9302002.github.io/vstream-iccv-page/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> |
|
|
<a href='https://huggingface.co/papers/2506.23825'><img src='https://img.shields.io/badge/Paper-HuggingFace-red'></a> |
|
|
<a href='https://github.com/IVGSZ/Flash-VStream'><img src='https://img.shields.io/badge/Code-GitHub-blue.svg?logo=github'></a> |
|
|
|
|
|
## Model details |
|
|
We proposed Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. |
|
|
|
|
|
## Training data |
|
|
This model is trained based on image data from LLaVA-1.5 dataset, and video data from WebVid and ActivityNet datasets following LLaMA-VID, including |
|
|
- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP. |
|
|
- 158K GPT-generated multimodal instruction-following data. |
|
|
- 450K academic-task-oriented VQA data mixture. |
|
|
- 40K ShareGPT data. |
|
|
- 232K video-caption pairs sampled from the WebVid 2.5M dataset. |
|
|
- 98K videos from ActivityNet with QA pairs from Video-ChatGPT. |
|
|
|
|
|
## Sample Usage |
|
|
|
|
|
You can load and use Flash-VStream with the `transformers` library. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
# The model can be loaded using multiple GPUs or offloaded to CPU if needed. |
|
|
# This example assumes GPU is available. |
|
|
model_path = 'IVGSZ/Flash-VStream-7b' # Replace with the actual model ID if different |
|
|
|
|
|
model = AutoModel.from_pretrained( |
|
|
model_path, |
|
|
torch_dtype=torch.bfloat16, # Use bfloat16 for efficient memory usage |
|
|
low_cpu_mem_usage=True, |
|
|
trust_remote_code=True |
|
|
).eval().cuda() |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False) |
|
|
|
|
|
# For detailed instructions on image/video preprocessing and chat interactions, |
|
|
# please refer to the official GitHub repository: |
|
|
# https://github.com/IVGSZ/Flash-VStream |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the [Apache-2.0 License](https://github.com/IVGSZ/Flash-VStream/blob/main/LICENSE). |