base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
language:
- en
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
The GeoVista-RL-6k-7B model is an agentic model presented in the paper GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization. This model is designed for geolocalization tasks, requiring nuanced visual grounding and web search to confirm or refine hypotheses during reasoning. GeoVista seamlessly integrates tool invocation, such as an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information, within its reasoning loop.
GeoVista achieves strong performance, surpassing other open-source agentic models on the geolocalization task and achieving performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
- Paper: GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
- Project Page: https://ekonwang.github.io/geo-vista/
- GitHub Repository: https://github.com/ekonwang/GeoVista
- GeoBench Dataset: https://huggingface.co/datasets/LibraTree/GeoBench
Usage with Transformers
You can use GeoVista-RL-6k-7B directly with the Hugging Face Transformers library.
First, ensure you have the transformers and accelerate libraries installed:
pip install transformers accelerate
Then, you can perform basic inference as follows. Note that for full agentic behavior involving web search, additional setup (like a Tavily API key and specific deployment with vLLM) as described in the GitHub repository might be required. This snippet demonstrates direct VLM capabilities.
from transformers import AutoProcessor, AutoModelForConditionalGeneration
from PIL import Image
import torch
# Load the model and processor
model_id = "LibraTree/GeoVista-RL-6k-7B"
model = AutoModelForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # Use torch.float16 or torch.bfloat16 for efficiency
device_map="auto",
trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Load your image (replace with your image path)
# Example image path from the GitHub repo: https://github.com/ekonwang/GeoVista/blob/main/examples/geobench-example.png
image = Image.open("examples/geobench-example.png").convert("RGB")
# Define the conversational prompt
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Please analyze where is the place."},\
],
}
]
# Apply chat template and process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
# Example output might include reasoning steps and a final location prediction
Benchmark
GeoVista was evaluated on the newly curated GeoBench dataset, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models.
GeoBench is the first high-resolution, multi-source, globally annotated dataset to evaluate agentic modelsβ general geolocalization ability. The benchmark assesses models along five axes: Global Coverage (GC), Reasonable Localizability (RC), High Resolution (HR), Data Variety (DV), and Nuanced Evaluation (NE).
| Benchmark | Year | GC | RC | HR | DV | NE |
|---|---|---|---|---|---|---|
| Im2GPS | 2008 | β | ||||
| YFCC4k | 2017 | β | ||||
| Google Landmarks v2 | 2020 | β | ||||
| VIGOR | 2022 | β | ||||
| OSV-5M | 2024 | β | β | β | ||
| GeoComp | 2025 | β | β | β | ||
| GeoBench (ours) | 2025 | β | β | β | β | β |
Citation
If you find this work helpful or inspiring, please consider citing the paper:
@misc{wang2025geovistawebaugmentedagenticvisual,
title={GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization},
author={Yikun Wang and Zuyan Liu and Ziyi Wang and Pengfei Liu and Han Hu and Yongming Rao},
year={2025},
eprint={2511.15705},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.15705},
}
Acknowledgements
We thank Tavily, Google Cloud for providing reliable web search API and geocoding services for research use. Also we thank Mapillary for providing high-quality street-level images around the world. We would like to thank the contributors to the VeRL, TRL, gpt-researcher and DeepEyes repositories, for their open-sourced framework or research.