---
license: apache-2.0
pipeline_tag: image-text-to-text
---

<br>
<br>

# Spatial-LLaVA-7B Model Card

**[Github Repo](https://github.com/xi-jiajun/Spatial-LLaVA)**

**[🤗 Huggingface Space Demo](https://huggingface.co/spaces/rogerxi/Spatial-LLaVA)**

## 🤖 Model details 

**Model type:**

This finetuned LLaVA model is trained from [liuhaotian/llava-pretrain-vicuna-7b-v1.3](https://huggingface.co/liuhaotian/llava-pretrain-vicuna-7b-v1.3) for improving spatial relation reasoning of large multi-modal model. 

LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data.
It is an auto-regressive language model, based on the transformer architecture.

## 🎯 Intended use 
**Primary intended uses:**
The primary use of LLaVA is research on large multimodal models and chatbots.

**Primary intended users:**
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

## 📚 Training dataset 
Instruction following training: [rogerxi/LLaVA-Spatial-Instruct-850K](https://huggingface.co/datasets/rogerxi/LLaVA-Spatial-Instruct-850K)

## 📊 Evaluation 
A collection of 10 benchmarks:
| Model                  |   VQAv2  |    GQA   |  VizWiz  |    SQA   |  TextVQA |   POPE   |     MME    | MM-Bench | MM-Bench-cn |  MM-Vet  |
|:-----------------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:----------:|:--------:|:-----------:|:--------:|
|   LLaVA-1.5-7b   |   78.5   |   62.0   | **50.0** |   66.8   |   58.2   |   85.9   | **1510.7** |   64.3   |     58.3    |   31.1   |
| Spatial-LLaVA-7b | **79.7** | **62.7** |   48.7   | **68.7** | **58.5** | **87.2** |   1472.7   | **67.8** |   **60.7**  | **31.6** |

[Spatial-Relation-Eval](https://huggingface.co/datasets/rogerxi/Spatial-Relation-Eval) (built based on [SpatialRGPT-Bench](https://huggingface.co/datasets/a8cheng/SpatialRGPT-Bench)):
### Qualitative Spatial Relations

| Model                 | Below/Above | Left/Right | Big/Small | Tall/Short | Wide/Thin | Behind/Front | Avg |
|:-----------------------:|:------------:|:-----------:|:----------:|:-----------:|:----------:|:-------------:|:-------------: |
|   LLaVA-1.5-7b        |      53.91 |     53.49 |    45.36 |     40.00 |    **50.00** |     51.04 |  48.97  |
|   LLaVA-1.5-13b        |      54.28 |     52.32 |    45.36 |     48.57 |    49.02 |     47.92 |  49.67  | 
|   Spatial-LLaVA-7b    |      **56.32** |     **66.28** |    **60.82** |     **48.57** |    49.02 |      **52.08** | **55.12** |

### Quantitative Spatial Relations

| Model                 | Direct Dist (m / ratio) | Horizontal Dist (m / ratio) | Vertical Dist (m / ratio) | Width (m / ratio) | Height (m / ratio) | Direction (° / ratio) |
|:-----------------------:|:------------------------:|:----------------------------:|:--------------------------:|:--------------------------:|:--------------------------:|:--------------------------:|
| LLaVA-1.5-7b          |     12.90 / 1.06         |     10.68 / 2.03             |     20.79 / 0.94          |     **24.19 / 0.50**  |     14.29 / 5.27   |      10.23 / 58.33    |
| LLaVA-1.5-13b          |     13.71 / 0.93         |     10.68 / 3.56             |     16.83 / 0.85          |    15.32 / 0.57  |    17.67 / 5.8   |      14.77 / 54.29    |
| Spatial-LLaVA-7b      |     **24.19 / 0.57**         |     **14.56 / 0.62**             |     **41.58 / 0.42**          |     22.58 / 1.12  |     **18.25 / 2.92**   |      **20.45 / 56.47**    |

## 🙏 Acknowledgements
We thank Liu Haotian et al. for the LLaVA pretrained script, weights and LLaVA-v1.5 mixture dataset; the teams behind CLEVR, TextCaps, VisualMRC and VQAv2 (via “HuggingFaceM4/the_cauldron”); remyxai for OpenSpaces; Anjie Cheng et al. for Spatial-Bench and data pipeline; Google for OpenImages; and Hugging Face for their datasets infrastructure.