Improve model card: Add pipeline tag, library, project page, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +82 -3
README.md CHANGED
@@ -1,8 +1,21 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
4
  # VPP-LLaVA Model Card
5
 
 
 
 
 
 
 
 
 
 
 
6
  ## Model Details
7
 
8
  **Model Type**: VPP-LLaVA is an enhanced multimodal model built upon the LLaVA architecture. It is designed to improve visual grounding capabilities by incorporating Visual Position Prompts (VPP) into the original LLaVA model. LLaVA itself is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model based on the transformer architecture.
@@ -11,7 +24,7 @@ license: apache-2.0
11
 
12
  **Paper or Resources for More Information**:
13
  - Original LLaVA: [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/)
14
- - VPP-LLaVA Enhancements: [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/pdf/2503.15426)
15
 
16
  ## License
17
 
@@ -27,6 +40,53 @@ For questions or comments about VPP-LLaVA, please refer to the GitHub repository
27
 
28
  **Primary Intended Users**: The primary intended users of VPP-LLaVA are researchers and hobbyists in the fields of computer vision, natural language processing, machine learning, and artificial intelligence, who are interested in exploring advanced multimodal models and improving visual grounding performance.
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## Training Dataset
31
 
32
  The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: [VPP-SFT](https://huggingface.co/datasets/wayneicloud/VPP-SFT/tree/main). This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks. Please refer to our [VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA) for more details.
@@ -42,7 +102,7 @@ The evaluation dataset for VPP-LLaVA includes the following benchmarks:
42
 
43
  ## Model Enhancements
44
 
45
- VPP-LLaVA introduces Visual Position Prompts (VPP) to the original LLaVA architecture to enhance visual grounding capabilities. The enhancements are based on the research presented in the paper [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/pdf/2503.15426). The VPP mechanism includes:
46
  - **Global VPP**: Provides a global position reference by overlaying learnable, axis-like embeddings onto the input image.
47
  - **Local VPP**: Focuses on fine-grained localization by incorporating position-aware queries that suggest probable object locations.
48
 
@@ -52,4 +112,23 @@ These enhancements enable VPP-LLaVA to achieve state-of-the-art performance in v
52
 
53
  VPP-LLaVA demonstrates remarkable zero-shot performance on unseen datasets, particularly in challenging scenarios involving part-object and multi-object situations. This capability is crucial for real-world applications where the model may encounter previously unseen objects or complex scenes. The model's ability to generalize and accurately ground visual references in these scenarios highlights its robustness and adaptability.
54
 
55
- VPP-LLaVA paper link: https://arxiv.org/abs/2503.15426
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  ---
6
+
7
  # VPP-LLaVA Model Card
8
 
9
+ ## Abstract
10
+
11
+ Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address these issues, we introduce VPP-LLaVA, an MLLM enhanced with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms: the global VPP overlays a learnable, axis-like tensor onto the input image to provide structured spatial cues, while the local VPP incorporates position-aware queries to support fine-grained localization. To effectively train our model with spatial guidance, we further introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples. Designed in a compact format, it enables efficient training and is significantly smaller than datasets used by other MLLMs (e.g., ~21M samples in MiniGPT-v2), yet still provides a strong performance boost. The resulting model, VPP-LLaVA, not only achieves state-of-the-art results on standard visual grounding benchmarks but also demonstrates strong zero-shot generalization to challenging unseen datasets. The code and dataset are available at [this GitHub repository](https://github.com/WayneTomas/VPP-LLaVA).
12
+
13
+ ## Links
14
+
15
+ * **Paper**: [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/abs/2503.15426)
16
+ * **Code (GitHub)**: [https://github.com/WayneTomas/VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA)
17
+ * **Project Page**: [https://osatlas.github.io/](https://osatlas.github.io/)
18
+
19
  ## Model Details
20
 
21
  **Model Type**: VPP-LLaVA is an enhanced multimodal model built upon the LLaVA architecture. It is designed to improve visual grounding capabilities by incorporating Visual Position Prompts (VPP) into the original LLaVA model. LLaVA itself is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model based on the transformer architecture.
 
24
 
25
  **Paper or Resources for More Information**:
26
  - Original LLaVA: [LLaVA: Large Language and Vision Assistant](https://llava-vl.github.io/)
27
+ - VPP-LLaVA Enhancements: [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/abs/2503.15426)
28
 
29
  ## License
30
 
 
40
 
41
  **Primary Intended Users**: The primary intended users of VPP-LLaVA are researchers and hobbyists in the fields of computer vision, natural language processing, machine learning, and artificial intelligence, who are interested in exploring advanced multimodal models and improving visual grounding performance.
42
 
43
+ ## Usage
44
+
45
+ You can use `VPP-LLaVA-13b` with the `transformers` library. The model is designed for visual grounding tasks.
46
+
47
+ ```python
48
+ from transformers import AutoProcessor, AutoModelForCausalLM
49
+ from PIL import Image
50
+ import requests
51
+ import torch
52
+
53
+ # Load model and processor.
54
+ # Note: trust_remote_code=True is required for custom model architectures.
55
+ model_id = "wayneicloud/VPP-LLaVA-13b" # Replace with the actual model ID on Hugging Face
56
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
57
+ model = AutoModelForCausalLM.from_pretrained(
58
+ model_id,
59
+ torch_dtype=torch.bfloat16, # Adjust dtype based on your hardware and model's config
60
+ device_map="auto",
61
+ trust_remote_code=True
62
+ )
63
+
64
+ # Prepare an example image
65
+ image_url = "https://llava-vl.github.io/static/images/a-man-and-a-woman.jpg"
66
+ image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
67
+
68
+ # Define the text instruction for visual grounding
69
+ prompt_text = "What is the bounding box of the woman? Use <box> and </box> tokens."
70
+ messages = [
71
+ {"role": "user", "content": prompt_text}
72
+ ]
73
+
74
+ # Process inputs (image + text)
75
+ # The processor automatically handles the image token insertion and chat templating.
76
+ inputs = processor(text=messages, images=image, return_tensors="pt").to(model.device)
77
+
78
+ # Generate response
79
+ with torch.no_grad():
80
+ output_ids = model.generate(**inputs, max_new_tokens=100)
81
+
82
+ # Decode and print the generated text
83
+ generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
84
+ print(f"Generated response: {generated_text}")
85
+ # Expected output might be in the format like:
86
+ # Generated response: The bounding box of the woman is <box> [x1, y1, x2, y2] </box>
87
+
88
+ ```
89
+
90
  ## Training Dataset
91
 
92
  The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: [VPP-SFT](https://huggingface.co/datasets/wayneicloud/VPP-SFT/tree/main). This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks. Please refer to our [VPP-LLaVA](https://github.com/WayneTomas/VPP-LLaVA) for more details.
 
102
 
103
  ## Model Enhancements
104
 
105
+ VPP-LLaVA introduces Visual Position Prompts (VPP) to the original LLaVA architecture to enhance visual grounding capabilities. The enhancements are based on the research presented in the paper [Visual Position Prompt for MLLM based Visual Grounding](https://arxiv.org/abs/2503.15426). The VPP mechanism includes:
106
  - **Global VPP**: Provides a global position reference by overlaying learnable, axis-like embeddings onto the input image.
107
  - **Local VPP**: Focuses on fine-grained localization by incorporating position-aware queries that suggest probable object locations.
108
 
 
112
 
113
  VPP-LLaVA demonstrates remarkable zero-shot performance on unseen datasets, particularly in challenging scenarios involving part-object and multi-object situations. This capability is crucial for real-world applications where the model may encounter previously unseen objects or complex scenes. The model's ability to generalize and accurately ground visual references in these scenarios highlights its robustness and adaptability.
114
 
115
+ VPP-LLaVA paper link: https://arxiv.org/abs/2503.15426
116
+
117
+ ## Acknowledgements
118
+ This repo is changed from [LLaVA v1.5](https://github.com/haotian-liu/LLaVA). The repo also benifits form [ChatterBox (AAAI 2025)](https://github.com/sunsmarterjie/ChatterBox) and [Genixer (ECCV 2024)](https://github.com/zhaohengyuan1/Genixer)
119
+
120
+ Thanks for their wonderful works.
121
+
122
+ ## Citation
123
+
124
+ ```bibtex
125
+ @misc{tang2025visualpositionpromptmllm,
126
+ title={Visual Position Prompt for MLLM based Visual Grounding},
127
+ author={Wei Tang and Yanpeng Sun and Qinying Gu and Zechao Li},
128
+ year={2025},
129
+ eprint={2503.15426},
130
+ archivePrefix={arXiv},
131
+ primaryClass={cs.CV},
132
+ url={https://arxiv.org/abs/2503.15426},
133
+ }
134
+ ```