Add pipeline tag (#1)

690d4c0 verified about 1 year ago

6.07 kB

	---
	license: gpl-3.0
	pipeline_tag: image-text-to-text
	---
	# LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

	[![arXiv](https://img.shields.io/badge/arXiv-2501.03895-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2501.03895)
	[![model](https://img.shields.io/badge/%F0%9F%A4%97%20huggingface%20-llava--mini--llama--3.1--8b-orange.svg)](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b)

	> *[Shaolei Zhang](https://zhangshaolei1998.github.io/), [Qingkai Fang](https://fangqingkai.github.io/), [Zhe Yang](https://nlp.ict.ac.cn/yjdw/xs/ssyjs/202210/t20221020_52708.html), [Yang Feng](https://people.ucas.edu.cn/~yangfeng?language=en)**


	LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Guided by the interpretability within LMM, LLaVA-Mini significantly improves efficiency while ensuring vision capabilities. [Code](https://github.com/ictnlp/LLaVA-Mini), [model](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b) and [demo](https://github.com/ictnlp/LLaVA-Mini#-demo) of LLaVA-Mini are available now!

	Refer to our [GitHub repo]((https://github.com/ictnlp/LLaVA-Mini)) for details of LLaVA-Mini!

	> [!Note]
	> LLaVA-Mini only requires 1 token to represent each image, which improves the efficiency of image and video understanding, including:
	> - Computational effort: 77% FLOPs reduction
	> - Response latency: reduce from 100 milliseconds to 40 milliseconds
	> - VRAM memory usage: reduce from 360 MB/image to 0.6 MB/image, support 3-hour video processing


	<p align="center" width="100%">
	<img src="./assets/performance.png" alt="performance" style="width: 100%; min-width: 300px; display: block; margin: auto;">
	</p>

	💡Highlight:
	1. Good Performance: LLaVA-Mini achieves performance comparable to LLaVA-v1.5 while using only 1 vision token instead of 576 (compression rate of 0.17%).
	2. High Efficiency: LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
	3. Insights: To develop LLaVA-Mini, which reduces vision tokens while maintaining visual understanding, we conduct a preliminary analysis to explore how large multimodal models (LMMs) process visual tokens. Please refer to our [paper](https://arxiv.org/pdf/2501.03895) for a detailed analysis and our conclusions.

	## 🖥 Demo
	<p align="center" width="100%">
	<img src="./assets/llava_mini.gif" alt="llava_mini" style="width: 100%; min-width: 300px; display: block; margin: auto;">
	</p>

	- Download LLaVA-Mini model from [here](https://huggingface.co/ICTNLP/llava-mini-llama-3.1-8b).

	- Run these scripts and Interact with LLaVA-Mini in your browser:

	```bash
	# Launch a controller
	python -m llavamini.serve.controller --host 0.0.0.0 --port 10000 &

	# Build the API of LLaVA-Mini
	CUDA_VISIBLE_DEVICES=0 python -m llavamini.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ICTNLP/llava-mini-llama-3.1-8b --model-name llava-mini &

	# Start the interactive interface
	python -m llavamini.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --port 7860
	```

	## 🔥 Quick Start
	### Requirements
	- Install packages:

	```bash
	conda create -n llavamini python=3.10 -y
	conda activate llavamini
	pip install -e .
	pip install -e ".[train]"
	pip install flash-attn --no-build-isolation
	```

	### Command Interaction
	- Image understanding, using `--image-file `:

	```bash
	# Image Understanding
	CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
	--model-path ICTNLP/llava-mini-llama-3.1-8b \
	--image-file llavamini/serve/examples/baby_cake.png \
	--conv-mode llava_llama_3_1 --model-name "llava-mini" \
	--query "What's the text on the cake?"
	```

	- Video understanding, using `--video-file `:

	```bash
	# Video Understanding
	CUDA_VISIBLE_DEVICES=0 python llavamini/eval/run_llava_mini.py \
	--model-path ICTNLP/llava-mini-llama-3.1-8b \
	--video-file llavamini/serve/examples/fifa.mp4 \
	--conv-mode llava_llama_3_1 --model-name "llava-mini" \
	--query "What happened in this video?"
	```

	### Reproduction and Evaluation

	- Refer to [Evaluation.md](docs/Evaluation.md) for the evaluation of LLaVA-Mini on image/video benchmarks.

	### Cases
	- LLaVA-Mini achieves high-quality image understanding and video understanding.

	<p align="center" width="100%">
	<img src="./assets/case1.png" alt="case1" style="width: 100%; min-width: 300px; display: block; margin: auto;">
	</p>

	<details>
	<summary>More cases</summary>
	<p align="center" width="100%">
	<img src="./assets/case2.png" alt="case2" style="width: 100%; min-width: 300px; display: block; margin: auto;">
	</p>

	<p align="center" width="100%">
	<img src="./assets/case3.png" alt="case3" style="width: 100%; min-width: 300px; display: block; margin: auto;">
	</p>

	<p align="center" width="100%">
	<img src="./assets/case4.png" alt="case4" style="width: 100%; min-width: 300px; display: block; margin: auto;">
	</p>

	</details>

	- LLaVA-Mini dynamically compresses image to capture important visual information (brighter areas are more heavily weighted during compression).

	<p align="center" width="100%">
	<img src="./assets/compression.png" alt="compression" style="width: 100%; min-width: 300px; display: block; margin: auto;">
	</p>



	## 🖋Citation

	If this repository is useful for you, please cite as:

	```
	@misc{llavamini,
	title={LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token},
	author={Shaolei Zhang and Qingkai Fang and Zhe Yang and Yang Feng},
	year={2025},
	eprint={2501.03895},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2501.03895},
	}
	```

	If you have any questions, please feel free to submit an issue or contact `zhangshaolei20z@ict.ac.cn`.