base_model:
- liuhaotian/llava-v1.5-7b
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- visual-grounding
- spatial-reasoning
VPP-LLaVA: Visual Position Prompt for MLLM based Visual Grounding
This repository contains the VPP-LLaVA model, an enhanced multimodal large language model built upon the LLaVA architecture, designed to improve visual grounding capabilities by incorporating Visual Position Prompts (VPP).
The model was presented in the paper Visual Position Prompt for MLLM based Visual Grounding.
Code: https://github.com/WayneTomas/VPP-LLaVA
Model Details
Model Type: VPP-LLaVA is an enhanced multimodal model built upon the LLaVA architecture. It is designed to improve visual grounding capabilities by incorporating Visual Position Prompts (VPP) into the original LLaVA model. LLaVA itself is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model based on the transformer architecture.
Model Date: The VPP-LLaVA-7b enhancements were developed and tested based on the LLaVA-v1.5-7B model, which was trained in Feb. 2025.
About VPP-LLaVA
Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability.
To address these issues, VPP-LLaVA introduces an MLLM enhanced with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms: the global VPP overlays a learnable, axis-like tensor onto the input image to provide structured spatial cues, while the local VPP incorporates position-aware queries to support fine-grained localization. To effectively train our model with spatial guidance, we further introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples. Designed in a compact format, it enables efficient training and is significantly smaller than datasets used by other MLLMs (e.g., ~21M samples in MiniGPT-v2), yet still provides a strong performance boost. The resulting model, VPP-LLaVA, not only achieves state-of-the-art results on standard visual grounding benchmarks but also demonstrates strong zero-shot generalization to challenging unseen datasets.
Examples of VPP-LLaVA
Our method shows strong zero-shot capability on the more complicated dataset of GSEval-BBox, especially when dealing with part-object and multi-object scenarios. In the visualizations, green represents the ground truth (GT), red represents our VPP-LLaVA-7B, and purple represents Qwen2.5-VL-7B.
Quick Start With HuggingFace
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
import torch
from PIL import Image
model_path = "wayneicloud/VPP-LLaVA-7b" # or "wayneicloud/VPP-LLaVA-13b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path)
)
# Example usage for visual grounding
# (Note: Specific input format and processing details might vary, refer to original GitHub for full implementation)
prompt = "Describe the image and locate the object 'tree' (with bbox)."
image_file = "path/to/your/image.jpg" # Replace with your image path
image = Image.open(image_file).convert("RGB")
# You'll need to process the image according to VPP-LLaVA's input requirements
# This might involve functions from llava.mm_utils or a custom preprocessor
# For simplicity, this example assumes a basic image processing step leading to `image_tensor`
# Assuming `image_tensor` is prepared and `prompt_ids` are tokenized
# For a full example, refer to the project's GitHub repository.
# input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').to(model.device)
# outputs = model.generate(input_ids, images=image_tensor, image_sizes=[image.size], max_new_tokens=100)
# print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Dataset
The training dataset for VPP-LLaVA is the VPP-SFT dataset, which is available on Hugging Face: VPP-SFT. This dataset contains about 0.6M high-quality visual grounding samples, designed to efficiently train the model for improved visual grounding tasks.
Evaluation Dataset
The evaluation dataset for VPP-LLaVA includes the following benchmarks:
- RefCOCO
- RefCOCO+
- RefCOCOg
- ReferIt
- GSEval-BBox
License
The original LLaVA model is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved. The enhancements and modifications for VPP-LLaVA are intended for research use only and follow the same licensing principles.
Citation
If you find this work helpful, please cite our paper:
@misc{tang2025visualpositionpromptmllm,
title={Visual Position Prompt for MLLM based Visual Grounding},
author={Wei Tang and Yanpeng Sun and Qinying Gu and Zechao Li},
year={2025},
eprint={2503.15426},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.15426},
}