ppllava_weight / README.md

nielsr HF Staff

Add model card for PPLLaVA

f98f804 verified about 1 month ago

1.91 kB

license: apache-2.0
pipeline_tag: video-text-to-text

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

PPLLaVA (Prompt-guided Pooling LLaVA) is an efficient Video Large Language Model designed to process long and varied video sequences by adaptively compressing visual tokens based on user instructions.

Paper: PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Repository: GitHub - farewellthree/ppllava

Introduction

PPLLaVA addresses the computational overhead of Video LLMs caused by high redundancy in video content. It introduces three key components:

CLIP-based visual-prompt alignment module: Identifies regions of interest based on user instructions.
Prompt-guided pooling mechanism: Adaptively compresses the visual sequence using convolution-style pooling, achieving up to 18x token reduction.
Clip context extension module: Tailored for processing long and complex prompts in visual dialogues.

The model achieves state-of-the-art results on diverse video understanding benchmarks, including VideoMME, MVBench, and ActivityNetQA, while significantly improving inference throughput (up to 8x faster).

Usage

Please refer to the official GitHub repository for detailed instructions on installation, environment setup, and running the Gradio demo.

Citation

If you find the code and paper useful for your research, please consider citing:

@article{liu2024ppllava,
  title={PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance},
  author={Liu, Ruyang and Tang, Haoran and Liu, Haibo and Ge, Yixiao and Shan, Ying and Li, Chen and Yang, Jiankun},
  journal={arXiv preprint arXiv:2411.02327},
  year={2024}
}