Spaces:

merve
/

vision_papers

Running

vision_papers / pages /PLLaVA /PLLaVA .md

Upload 174 files

94e735e verified over 1 year ago

1.54 kB

	Parameter-free LLaVA for video captioning works like magic! 🤩 Let's take a look!

	![image_1](image_1.jpg)

	Most of the video captioning models work by downsampling video frames to reduce computational complexity and memory requirements without losing a lot of information in the process.
	PLLaVA on the other hand, uses pooling! 🤩

	How? 🧐 It takes in frames of video, passed to ViT and then projection layer, and then output goes through average pooling where input shape is (# frames, width, height, text decoder input dim) 👇

	![image_2](image_2.jpg)

	Pooling operation surprisingly reduces the loss of spatial and temporal information. See below some examples on how it can capture the details 🤗

	![image_3](image_3.jpg)

	according to authors' findings, it performs way better than many of the existing models (including proprietary VLMs) and scales very well (on text decoder)

	![image_4](image_4.jpg)

	Model repositories 🤗 [7B](https://t.co/AeSdYsz1U7), [13B](https://t.co/GnI1niTxO7), [34B](https://t.co/HWAM0ZzvDc)
	Spaces🤗 [7B](https://t.co/Oms2OLkf7O), [13B](https://t.co/C2RNVNA4uR)

	> [!TIP]
	Ressources:
	[PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning](https://arxiv.org/abs/2404.16994)
	by Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, Jiashi Feng (2024)
	[GitHub](https://github.com/magic-research/PLLaVA)

	> [!NOTE]
	[Original tweet](https://twitter.com/mervenoyann/status/1786336055425138939) (May 3, 2024)