Add model card for PPLLaVA
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,3 +1,37 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: video-text-to-text
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
|
| 7 |
+
|
| 8 |
+
[PPLLaVA](https://huggingface.co/papers/2411.02327) (Prompt-guided Pooling LLaVA) is an efficient Video Large Language Model designed to process long and varied video sequences by adaptively compressing visual tokens based on user instructions.
|
| 9 |
+
|
| 10 |
+
- **Paper:** [PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance](https://huggingface.co/papers/2411.02327)
|
| 11 |
+
- **Repository:** [GitHub - farewellthree/ppllava](https://github.com/farewellthree/ppllava)
|
| 12 |
+
|
| 13 |
+
## Introduction
|
| 14 |
+
|
| 15 |
+
PPLLaVA addresses the computational overhead of Video LLMs caused by high redundancy in video content. It introduces three key components:
|
| 16 |
+
1. **CLIP-based visual-prompt alignment module**: Identifies regions of interest based on user instructions.
|
| 17 |
+
2. **Prompt-guided pooling mechanism**: Adaptively compresses the visual sequence using convolution-style pooling, achieving up to 18x token reduction.
|
| 18 |
+
3. **Clip context extension module**: Tailored for processing long and complex prompts in visual dialogues.
|
| 19 |
+
|
| 20 |
+
The model achieves state-of-the-art results on diverse video understanding benchmarks, including VideoMME, MVBench, and ActivityNetQA, while significantly improving inference throughput (up to 8x faster).
|
| 21 |
+
|
| 22 |
+
## Usage
|
| 23 |
+
|
| 24 |
+
Please refer to the [official GitHub repository](https://github.com/farewellthree/ppllava) for detailed instructions on installation, environment setup, and running the Gradio demo.
|
| 25 |
+
|
| 26 |
+
## Citation
|
| 27 |
+
|
| 28 |
+
If you find the code and paper useful for your research, please consider citing:
|
| 29 |
+
|
| 30 |
+
```bibtex
|
| 31 |
+
@article{liu2024ppllava,
|
| 32 |
+
title={PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance},
|
| 33 |
+
author={Liu, Ruyang and Tang, Haoran and Liu, Haibo and Ge, Yixiao and Shan, Ying and Li, Chen and Yang, Jiankun},
|
| 34 |
+
journal={arXiv preprint arXiv:2411.02327},
|
| 35 |
+
year={2024}
|
| 36 |
+
}
|
| 37 |
+
```
|