license: apache-2.0
pipeline_tag: video-text-to-text
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
PPLLaVA (Prompt-guided Pooling LLaVA) is an efficient Video Large Language Model designed to process long and varied video sequences by adaptively compressing visual tokens based on user instructions.
- Paper: PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
- Repository: GitHub - farewellthree/ppllava
Introduction
PPLLaVA addresses the computational overhead of Video LLMs caused by high redundancy in video content. It introduces three key components:
- CLIP-based visual-prompt alignment module: Identifies regions of interest based on user instructions.
- Prompt-guided pooling mechanism: Adaptively compresses the visual sequence using convolution-style pooling, achieving up to 18x token reduction.
- Clip context extension module: Tailored for processing long and complex prompts in visual dialogues.
The model achieves state-of-the-art results on diverse video understanding benchmarks, including VideoMME, MVBench, and ActivityNetQA, while significantly improving inference throughput (up to 8x faster).
Usage
Please refer to the official GitHub repository for detailed instructions on installation, environment setup, and running the Gradio demo.
Citation
If you find the code and paper useful for your research, please consider citing:
@article{liu2024ppllava,
title={PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance},
author={Liu, Ruyang and Tang, Haoran and Liu, Haibo and Ge, Yixiao and Shan, Ying and Li, Chen and Yang, Jiankun},
journal={arXiv preprint arXiv:2411.02327},
year={2024}
}