GeoVista-RL-12k-7B / README.md
nielsr's picture
nielsr HF Staff
Enhance model card: Add metadata, links, description, and usage example
25dd9ff verified
|
raw
history blame
6.13 kB
metadata
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
language:
  - en
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

The GeoVista-RL-6k-7B model is an agentic model presented in the paper GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization. This model is designed for geolocalization tasks, requiring nuanced visual grounding and web search to confirm or refine hypotheses during reasoning. GeoVista seamlessly integrates tool invocation, such as an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information, within its reasoning loop.

GeoVista achieves strong performance, surpassing other open-source agentic models on the geolocalization task and achieving performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.

GeoVista Agentic Pipeline

Usage with Transformers

You can use GeoVista-RL-6k-7B directly with the Hugging Face Transformers library.

First, ensure you have the transformers and accelerate libraries installed:

pip install transformers accelerate

Then, you can perform basic inference as follows. Note that for full agentic behavior involving web search, additional setup (like a Tavily API key and specific deployment with vLLM) as described in the GitHub repository might be required. This snippet demonstrates direct VLM capabilities.

from transformers import AutoProcessor, AutoModelForConditionalGeneration
from PIL import Image
import torch

# Load the model and processor
model_id = "LibraTree/GeoVista-RL-6k-7B"
model = AutoModelForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Use torch.float16 or torch.bfloat16 for efficiency
    device_map="auto",
    trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load your image (replace with your image path)
# Example image path from the GitHub repo: https://github.com/ekonwang/GeoVista/blob/main/examples/geobench-example.png
image = Image.open("examples/geobench-example.png").convert("RGB") 

# Define the conversational prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Please analyze where is the place."},\
        ],
    }
]

# Apply chat template and process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(generated_text)
# Example output might include reasoning steps and a final location prediction

Benchmark

GeoVista was evaluated on the newly curated GeoBench dataset, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models.

GeoBench is the first high-resolution, multi-source, globally annotated dataset to evaluate agentic models’ general geolocalization ability. The benchmark assesses models along five axes: Global Coverage (GC), Reasonable Localizability (RC), High Resolution (HR), Data Variety (DV), and Nuanced Evaluation (NE).

Benchmark Year GC RC HR DV NE
Im2GPS 2008 βœ“
YFCC4k 2017 βœ“
Google Landmarks v2 2020 βœ“
VIGOR 2022 βœ“
OSV-5M 2024 βœ“ βœ“ βœ“
GeoComp 2025 βœ“ βœ“ βœ“
GeoBench (ours) 2025 βœ“ βœ“ βœ“ βœ“ βœ“

Citation

If you find this work helpful or inspiring, please consider citing the paper:

@misc{wang2025geovistawebaugmentedagenticvisual,
      title={GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization}, 
      author={Yikun Wang and Zuyan Liu and Ziyi Wang and Pengfei Liu and Han Hu and Yongming Rao},
      year={2025},
      eprint={2511.15705},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15705}, 
}

Acknowledgements

We thank Tavily, Google Cloud for providing reliable web search API and geocoding services for research use. Also we thank Mapillary for providing high-quality street-level images around the world. We would like to thank the contributors to the VeRL, TRL, gpt-researcher and DeepEyes repositories, for their open-sourced framework or research.