Spatial-LLaVA-7B / README.md

Update README.md

4871778 verified 9 months ago

3.97 kB

	---
	license: apache-2.0
	pipeline_tag: image-text-to-text
	---

	<br>
	<br>

	# Spatial-LLaVA-7B Model Card

	[Github Repo](https://github.com/xi-jiajun/Spatial-LLaVA)

	[🤗 Huggingface Space Demo](https://huggingface.co/spaces/rogerxi/Spatial-LLaVA)

	## 🤖 Model details

	Model type:

	This finetuned LLaVA model is trained from [liuhaotian/llava-pretrain-vicuna-7b-v1.3](https://huggingface.co/liuhaotian/llava-pretrain-vicuna-7b-v1.3) for improving spatial relation reasoning of large multi-modal model.

	LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data.
	It is an auto-regressive language model, based on the transformer architecture.

	## 🎯 Intended use
	Primary intended uses:
	The primary use of LLaVA is research on large multimodal models and chatbots.

	Primary intended users:
	The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

	## 📚 Training dataset
	Instruction following training: [rogerxi/LLaVA-Spatial-Instruct-850K](https://huggingface.co/datasets/rogerxi/LLaVA-Spatial-Instruct-850K)

	## 📊 Evaluation
	A collection of 10 benchmarks:
	\| Model \| VQAv2 \| GQA \| VizWiz \| SQA \| TextVQA \| POPE \| MME \| MM-Bench \| MM-Bench-cn \| MM-Vet \|
	\|:-----------------------:\|:--------:\|:--------:\|:--------:\|:--------:\|:--------:\|:--------:\|:----------:\|:--------:\|:-----------:\|:--------:\|
	\| LLaVA-1.5-7b \| 78.5 \| 62.0 \| 50.0 \| 66.8 \| 58.2 \| 85.9 \| 1510.7 \| 64.3 \| 58.3 \| 31.1 \|
	\| Spatial-LLaVA-7b \| 79.7 \| 62.7 \| 48.7 \| 68.7 \| 58.5 \| 87.2 \| 1472.7 \| 67.8 \| 60.7 \| 31.6 \|

	[Spatial-Relation-Eval](https://huggingface.co/datasets/rogerxi/Spatial-Relation-Eval) (built based on [SpatialRGPT-Bench](https://huggingface.co/datasets/a8cheng/SpatialRGPT-Bench)):
	### Qualitative Spatial Relations

	\| Model \| Below/Above \| Left/Right \| Big/Small \| Tall/Short \| Wide/Thin \| Behind/Front \| Avg \|
	\|:-----------------------:\|:------------:\|:-----------:\|:----------:\|:-----------:\|:----------:\|:-------------:\|:-------------: \|
	\| LLaVA-1.5-7b \| 53.91 \| 53.49 \| 45.36 \| 40.00 \| 50.00 \| 51.04 \| 48.97 \|
	\| LLaVA-1.5-13b \| 54.28 \| 52.32 \| 45.36 \| 48.57 \| 49.02 \| 47.92 \| 49.67 \|
	\| Spatial-LLaVA-7b \| 56.32 \| 66.28 \| 60.82 \| 48.57 \| 49.02 \| 52.08 \| 55.12 \|

	### Quantitative Spatial Relations

	\| Model \| Direct Dist (m / ratio) \| Horizontal Dist (m / ratio) \| Vertical Dist (m / ratio) \| Width (m / ratio) \| Height (m / ratio) \| Direction (° / ratio) \|
	\|:-----------------------:\|:------------------------:\|:----------------------------:\|:--------------------------:\|:--------------------------:\|:--------------------------:\|:--------------------------:\|
	\| LLaVA-1.5-7b \| 12.90 / 1.06 \| 10.68 / 2.03 \| 20.79 / 0.94 \| 24.19 / 0.50 \| 14.29 / 5.27 \| 10.23 / 58.33 \|
	\| LLaVA-1.5-13b \| 13.71 / 0.93 \| 10.68 / 3.56 \| 16.83 / 0.85 \| 15.32 / 0.57 \| 17.67 / 5.8 \| 14.77 / 54.29 \|
	\| Spatial-LLaVA-7b \| 24.19 / 0.57 \| 14.56 / 0.62 \| 41.58 / 0.42 \| 22.58 / 1.12 \| 18.25 / 2.92 \| 20.45 / 56.47 \|

	## 🙏 Acknowledgements
	We thank Liu Haotian et al. for the LLaVA pretrained script, weights and LLaVA-v1.5 mixture dataset; the teams behind CLEVR, TextCaps, VisualMRC and VQAv2 (via “HuggingFaceM4/the_cauldron”); remyxai for OpenSpaces; Anjie Cheng et al. for Spatial-Bench and data pipeline; Google for OpenImages; and Hugging Face for their datasets infrastructure.