Add model card for PPLLaVA

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +37 -3
README.md CHANGED
@@ -1,3 +1,37 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ ---
5
+
6
+ # PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
7
+
8
+ [PPLLaVA](https://huggingface.co/papers/2411.02327) (Prompt-guided Pooling LLaVA) is an efficient Video Large Language Model designed to process long and varied video sequences by adaptively compressing visual tokens based on user instructions.
9
+
10
+ - **Paper:** [PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance](https://huggingface.co/papers/2411.02327)
11
+ - **Repository:** [GitHub - farewellthree/ppllava](https://github.com/farewellthree/ppllava)
12
+
13
+ ## Introduction
14
+
15
+ PPLLaVA addresses the computational overhead of Video LLMs caused by high redundancy in video content. It introduces three key components:
16
+ 1. **CLIP-based visual-prompt alignment module**: Identifies regions of interest based on user instructions.
17
+ 2. **Prompt-guided pooling mechanism**: Adaptively compresses the visual sequence using convolution-style pooling, achieving up to 18x token reduction.
18
+ 3. **Clip context extension module**: Tailored for processing long and complex prompts in visual dialogues.
19
+
20
+ The model achieves state-of-the-art results on diverse video understanding benchmarks, including VideoMME, MVBench, and ActivityNetQA, while significantly improving inference throughput (up to 8x faster).
21
+
22
+ ## Usage
23
+
24
+ Please refer to the [official GitHub repository](https://github.com/farewellthree/ppllava) for detailed instructions on installation, environment setup, and running the Gradio demo.
25
+
26
+ ## Citation
27
+
28
+ If you find the code and paper useful for your research, please consider citing:
29
+
30
+ ```bibtex
31
+ @article{liu2024ppllava,
32
+ title={PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance},
33
+ author={Liu, Ruyang and Tang, Haoran and Liu, Haibo and Ge, Yixiao and Shan, Ying and Li, Chen and Yang, Jiankun},
34
+ journal={arXiv preprint arXiv:2411.02327},
35
+ year={2024}
36
+ }
37
+ ```