Improve model card: Add pipeline tag, library, update links and license

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +59 -27
README.md CHANGED
@@ -1,27 +1,59 @@
1
- ---
2
- license: llama2
3
- tags:
4
- - vision-language model
5
- - llama
6
- - video understanding
7
- ---
8
-
9
- # Flash-VStream Model Card
10
- <a href='https://invinciblewyq.github.io/vstream-page/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
11
- <a href='https://arxiv.org/abs/2406.08085v1'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
12
-
13
- ## Model details
14
- We proposed Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously.
15
-
16
- ## Training data
17
- This model is trained based on image data from LLaVA-1.5 dataset, and video data from WebVid and ActivityNet datasets following LLaMA-VID, including
18
- - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
19
- - 158K GPT-generated multimodal instruction-following data.
20
- - 450K academic-task-oriented VQA data mixture.
21
- - 40K ShareGPT data.
22
- - 232K video-caption pairs sampled from the WebVid 2.5M dataset.
23
- - 98K videos from ActivityNet with QA pairs from Video-ChatGPT.
24
-
25
- ## License
26
-
27
- This project is licensed under the [LLAMA 2 License](LICENSE).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision-language model
5
+ - llama
6
+ - video understanding
7
+ pipeline_tag: video-text-to-text
8
+ library_name: transformers
9
+ ---
10
+
11
+ # Flash-VStream Model Card
12
+
13
+ This repository contains the Flash-VStream model presented in the paper [Flash-VStream: Efficient Real-Time Understanding for Long Video Streams](https://huggingface.co/papers/2506.23825).
14
+
15
+ <a href='https://zhang9302002.github.io/vstream-iccv-page/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
16
+ <a href='https://huggingface.co/papers/2506.23825'><img src='https://img.shields.io/badge/Paper-HuggingFace-red'></a>
17
+ <a href='https://github.com/IVGSZ/Flash-VStream'><img src='https://img.shields.io/badge/Code-GitHub-blue.svg?logo=github'></a>
18
+
19
+ ## Model details
20
+ We proposed Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously.
21
+
22
+ ## Training data
23
+ This model is trained based on image data from LLaVA-1.5 dataset, and video data from WebVid and ActivityNet datasets following LLaMA-VID, including
24
+ - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
25
+ - 158K GPT-generated multimodal instruction-following data.
26
+ - 450K academic-task-oriented VQA data mixture.
27
+ - 40K ShareGPT data.
28
+ - 232K video-caption pairs sampled from the WebVid 2.5M dataset.
29
+ - 98K videos from ActivityNet with QA pairs from Video-ChatGPT.
30
+
31
+ ## Sample Usage
32
+
33
+ You can load and use Flash-VStream with the `transformers` library.
34
+
35
+ ```python
36
+ import torch
37
+ from transformers import AutoModel, AutoTokenizer
38
+
39
+ # The model can be loaded using multiple GPUs or offloaded to CPU if needed.
40
+ # This example assumes GPU is available.
41
+ model_path = 'IVGSZ/Flash-VStream-7b' # Replace with the actual model ID if different
42
+
43
+ model = AutoModel.from_pretrained(
44
+ model_path,
45
+ torch_dtype=torch.bfloat16, # Use bfloat16 for efficient memory usage
46
+ low_cpu_mem_usage=True,
47
+ trust_remote_code=True
48
+ ).eval().cuda()
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
51
+
52
+ # For detailed instructions on image/video preprocessing and chat interactions,
53
+ # please refer to the official GitHub repository:
54
+ # https://github.com/IVGSZ/Flash-VStream
55
+ ```
56
+
57
+ ## License
58
+
59
+ This project is licensed under the [Apache-2.0 License](https://github.com/IVGSZ/Flash-VStream/blob/main/LICENSE).