Improve model card: Add pipeline tag, library, project page, and usage example
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,8 +1,21 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
| 4 |
# VPP-LLaVA Model Card
|
| 5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
## Model Details
|
| 7 |
|
| 8 |
**Model Type**: VPP-LLaVA is an enhanced multimodal model built upon the LLaVA architecture. It is designed to improve visual grounding capabilities by incorporating Visual Position Prompts (VPP) into the original LLaVA model. LLaVA itself is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model based on the transformer architecture.
|
|
@@ -11,7 +24,7 @@ license: apache-2.0
|
|
| 11 |
|
| 12 |
**Paper or Resources for More Information**:
|
| 13 |
- Original LLaVA: [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/)
|
| 14 |
-
- VPP-LLaVA Enhancements: [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/
|
| 15 |
|
| 16 |
## License
|
| 17 |
|
|
@@ -27,6 +40,53 @@ For questions or comments about VPP-LLaVA, please refer to the GitHub repository
|
|
| 27 |
|
| 28 |
**Primary Intended Users**: The primary intended users of VPP-LLaVA are researchers and hobbyists in the fields of computer vision, natural language processing, machine learning, and artificial intelligence, who are interested in exploring advanced multimodal models and improving visual grounding performance.
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
## Training Dataset
|
| 31 |
|
| 32 |
The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: [VPP-SFT](https://huggingface.co/datasets/wayneicloud/VPP-SFT/tree/main). This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks. Please refer to our [VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA) for more details.
|
|
@@ -42,7 +102,7 @@ The evaluation dataset for VPP-LLaVA includes the following benchmarks:
|
|
| 42 |
|
| 43 |
## Model Enhancements
|
| 44 |
|
| 45 |
-
VPP-LLaVA introduces Visual Position Prompts (VPP) to the original LLaVA architecture to enhance visual grounding capabilities. The enhancements are based on the research presented in the paper [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/
|
| 46 |
- **Global VPP**: Provides a global position reference by overlaying learnable, axis-like embeddings onto the input image.
|
| 47 |
- **Local VPP**: Focuses on fine-grained localization by incorporating position-aware queries that suggest probable object locations.
|
| 48 |
|
|
@@ -52,4 +112,23 @@ These enhancements enable VPP-LLaVA to achieve state-of-the-art performance in v
|
|
| 52 |
|
| 53 |
VPP-LLaVA demonstrates remarkable zero-shot performance on unseen datasets, particularly in challenging scenarios involving part-object and multi-object situations. This capability is crucial for real-world applications where the model may encounter previously unseen objects or complex scenes. The model's ability to generalize and accurately ground visual references in these scenarios highlights its robustness and adaptability.
|
| 54 |
|
| 55 |
-
VPP-LLaVA paper link: https://arxiv.org/abs/2503.15426
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
+
|
| 7 |
# VPP-LLaVA Model Card
|
| 8 |
|
| 9 |
+
## Abstract
|
| 10 |
+
|
| 11 |
+
Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address these issues, we introduce VPP-LLaVA, an MLLM enhanced with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms: the global VPP overlays a learnable, axis-like tensor onto the input image to provide structured spatial cues, while the local VPP incorporates position-aware queries to support fine-grained localization. To effectively train our model with spatial guidance, we further introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples. Designed in a compact format, it enables efficient training and is significantly smaller than datasets used by other MLLMs (e.g., ~21M samples in MiniGPT-v2), yet still provides a strong performance boost. The resulting model, VPP-LLaVA, not only achieves state-of-the-art results on standard visual grounding benchmarks but also demonstrates strong zero-shot generalization to challenging unseen datasets. The code and dataset are available at [this GitHub repository](https://github.com/WayneTomas/VPP-LLaVA).
|
| 12 |
+
|
| 13 |
+
## Links
|
| 14 |
+
|
| 15 |
+
* **Paper**: [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/abs/2503.15426)
|
| 16 |
+
* **Code (GitHub)**: [https://github.com/WayneTomas/VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA)
|
| 17 |
+
* **Project Page**: [https://osatlas.github.io/](https://osatlas.github.io/)
|
| 18 |
+
|
| 19 |
## Model Details
|
| 20 |
|
| 21 |
**Model Type**: VPP-LLaVA is an enhanced multimodal model built upon the LLaVA architecture. It is designed to improve visual grounding capabilities by incorporating Visual Position Prompts (VPP) into the original LLaVA model. LLaVA itself is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model based on the transformer architecture.
|
|
|
|
| 24 |
|
| 25 |
**Paper or Resources for More Information**:
|
| 26 |
- Original LLaVA: [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/)
|
| 27 |
+
- VPP-LLaVA Enhancements: [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/abs/2503.15426)
|
| 28 |
|
| 29 |
## License
|
| 30 |
|
|
|
|
| 40 |
|
| 41 |
**Primary Intended Users**: The primary intended users of VPP-LLaVA are researchers and hobbyists in the fields of computer vision, natural language processing, machine learning, and artificial intelligence, who are interested in exploring advanced multimodal models and improving visual grounding performance.
|
| 42 |
|
| 43 |
+
## Usage
|
| 44 |
+
|
| 45 |
+
You can use `VPP-LLaVA-13b` with the `transformers` library. The model is designed for visual grounding tasks.
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
from transformers import AutoProcessor, AutoModelForCausalLM
|
| 49 |
+
from PIL import Image
|
| 50 |
+
import requests
|
| 51 |
+
import torch
|
| 52 |
+
|
| 53 |
+
# Load model and processor.
|
| 54 |
+
# Note: trust_remote_code=True is required for custom model architectures.
|
| 55 |
+
model_id = "wayneicloud/VPP-LLaVA-13b" # Replace with the actual model ID on Hugging Face
|
| 56 |
+
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
| 57 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 58 |
+
model_id,
|
| 59 |
+
torch_dtype=torch.bfloat16, # Adjust dtype based on your hardware and model's config
|
| 60 |
+
device_map="auto",
|
| 61 |
+
trust_remote_code=True
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
# Prepare an example image
|
| 65 |
+
image_url = "https://llava-vl.github.io/static/images/a-man-and-a-woman.jpg"
|
| 66 |
+
image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
|
| 67 |
+
|
| 68 |
+
# Define the text instruction for visual grounding
|
| 69 |
+
prompt_text = "What is the bounding box of the woman? Use <box> and </box> tokens."
|
| 70 |
+
messages = [
|
| 71 |
+
{"role": "user", "content": prompt_text}
|
| 72 |
+
]
|
| 73 |
+
|
| 74 |
+
# Process inputs (image + text)
|
| 75 |
+
# The processor automatically handles the image token insertion and chat templating.
|
| 76 |
+
inputs = processor(text=messages, images=image, return_tensors="pt").to(model.device)
|
| 77 |
+
|
| 78 |
+
# Generate response
|
| 79 |
+
with torch.no_grad():
|
| 80 |
+
output_ids = model.generate(**inputs, max_new_tokens=100)
|
| 81 |
+
|
| 82 |
+
# Decode and print the generated text
|
| 83 |
+
generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
|
| 84 |
+
print(f"Generated response: {generated_text}")
|
| 85 |
+
# Expected output might be in the format like:
|
| 86 |
+
# Generated response: The bounding box of the woman is <box> [x1, y1, x2, y2] </box>
|
| 87 |
+
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
## Training Dataset
|
| 91 |
|
| 92 |
The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: [VPP-SFT](https://huggingface.co/datasets/wayneicloud/VPP-SFT/tree/main). This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks. Please refer to our [VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA) for more details.
|
|
|
|
| 102 |
|
| 103 |
## Model Enhancements
|
| 104 |
|
| 105 |
+
VPP-LLaVA introduces Visual Position Prompts (VPP) to the original LLaVA architecture to enhance visual grounding capabilities. The enhancements are based on the research presented in the paper [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/abs/2503.15426). The VPP mechanism includes:
|
| 106 |
- **Global VPP**: Provides a global position reference by overlaying learnable, axis-like embeddings onto the input image.
|
| 107 |
- **Local VPP**: Focuses on fine-grained localization by incorporating position-aware queries that suggest probable object locations.
|
| 108 |
|
|
|
|
| 112 |
|
| 113 |
VPP-LLaVA demonstrates remarkable zero-shot performance on unseen datasets, particularly in challenging scenarios involving part-object and multi-object situations. This capability is crucial for real-world applications where the model may encounter previously unseen objects or complex scenes. The model's ability to generalize and accurately ground visual references in these scenarios highlights its robustness and adaptability.
|
| 114 |
|
| 115 |
+
VPP-LLaVA paper link: https://arxiv.org/abs/2503.15426
|
| 116 |
+
|
| 117 |
+
## Acknowledgements
|
| 118 |
+
This repo is changed from [LLaVA v1.5](https://github.com/haotian-liu/LLaVA). The repo also benifits form [ChatterBox (AAAI 2025)](https://github.com/sunsmarterjie/ChatterBox) and [Genixer (ECCV 2024)](https://github.com/zhaohengyuan1/Genixer)
|
| 119 |
+
|
| 120 |
+
Thanks for their wonderful works.
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@misc{tang2025visualpositionpromptmllm,
|
| 126 |
+
title={Visual Position Prompt for MLLM based Visual Grounding},
|
| 127 |
+
author={Wei Tang and Yanpeng Sun and Qinying Gu and Zechao Li},
|
| 128 |
+
year={2025},
|
| 129 |
+
eprint={2503.15426},
|
| 130 |
+
archivePrefix={arXiv},
|
| 131 |
+
primaryClass={cs.CV},
|
| 132 |
+
url={https://arxiv.org/abs/2503.15426},
|
| 133 |
+
}
|
| 134 |
+
```
|