rogerxi
/

Spatial-LLaVA-7B

Image-Text-to-Text

Model card Files Files and versions

rogerxi commited on May 2, 2025

Commit

37fd3cc

·

verified ·

1 Parent(s): 015c8dd

Update README.md

Files changed (1) hide show

README.md +50 -3

README.md CHANGED Viewed

@@ -1,3 +1,50 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+pipeline_tag: image-text-to-text
+---
+<br>
+<br>
+# Spatial-LLaVA-7B Model Card
+## 🤖 Model details
+**Model type:**
+This finetuned LLaVA model is trained from [liuhaotian/llava-pretrain-vicuna-7b-v1.3](https://huggingface.co/liuhaotian/llava-pretrain-vicuna-7b-v1.3) for improving spatial relation reasoning of large multi-modal model.
+LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data.
+It is an auto-regressive language model, based on the transformer architecture.
+## 🎯 Intended use
+**Primary intended uses:**
+The primary use of LLaVA is research on large multimodal models and chatbots.
+**Primary intended users:**
+The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
+## 📚 Training dataset
+Instruction following training: [rogerxi/LLaVA-Spatial-Instruct-850K](https://huggingface.co/datasets/rogerxi/LLaVA-Spatial-Instruct-850K)
+## 📊 Evaluation
+A collection of 10 benchmarks:
+| Model                  |   VQAv2  |    GQA   |  VizWiz  |    SQA   |  TextVQA |   POPE   |     MME    | MM-Bench | MM-Bench-cn |  MM-Vet  |
+|:----------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:----------:|:--------:|:-----------:|:--------:|
+|   LLaVA-1.5-7b   |   78.5   |   62.0   | **50.0** |   66.8   |   58.2   |   85.9   | **1510.7** |   64.3   |     58.3    |   31.1   |
+| Spatial-LLaVA-7b | **79.7** | **62.7** |   48.7   | **68.7** | **58.5** | **87.2** |   1472.7   | **67.8** |   **60.7**  | **31.6** |
+[SpatialRGPT-Bench](https://huggingface.co/datasets/a8cheng/SpatialRGPT-Bench) (with placeholder replaced by object name):
+### Qualitative Spatial Relations
+| Model                 | Below/Above | Left/Right | Big/Small | Tall/Short | Wide/Thin | Behind/Front | Avg |
+|:-----------------------:|:------------:|:-----------:|:----------:|:-----------:|:----------:|:-------------:|:-------------: |
+|   LLaVA-1.5-7b        |      53.91 |     53.49 |    45.36 |     40.00 |    **50.00** |     51.04 |  48.97  |
+|   Spatial-LLaVA-7b    |      **56.32** |     **66.28** |    **60.82** |     **48.57** |    49.02 |      **52.08** | **55.12** |
+### Quantitative Spatial Relations
+| Model                 | Direct Dist (m / ratio) | Horizontal Dist (m / ratio) | Vertical Dist (m / ratio) | Width (m / ratio) | Height (m / ratio) | Direction (° / ratio) |
+|:-----------------------:|:------------------------:|:----------------------------:|:--------------------------:|:------------------:|:-------------------:|:----------------------:|
+| LLaVA-1.5-7b          |     12.90 / 1.06         |     10.68 / 2.03             |     20.79 / 0.94          |     **24.19 / 0.50**  |     14.29 / 5.27   |      10.23 / 58.33    |
+| Spatial-LLaVA-7b      |     **24.19 / 0.57**         |     **14.56 / 0.62**             |     **41.58 / 0.42**          |     22.58 / 1.12  |     **18.25 / 2.92**   |      **20.45 / 56.47**    |