Enhance model card: Add metadata, links, description, and usage example
Browse filesThis PR significantly enhances the model card for `LibraTree/GeoVista-RL-6k-7B` by adding crucial metadata and detailed content:
- **Metadata**: Added `pipeline_tag: image-text-to-text` and `library_name: transformers`. The `transformers` compatibility is evidenced by the `config.json` and `architectures` field.
- **Description**: Provided an introductory description of the GeoVista model and its application to geolocalization tasks.
- **Links**: Included direct links to the paper ([GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization](https://huggingface.co/papers/2511.15705)), the official project page, and the GitHub repository.
- **Sample Usage**: Added a Python code snippet demonstrating basic inference with the `transformers` library, which is consistent with the model's `transformers` compatibility and enables the automated "how to use" widget on the Hub.
- **Visuals & Benchmarking**: Incorporated an illustrative image of the agentic pipeline and a section detailing the GeoBench benchmark, along with its comparison table.
- **Citation**: Included the BibTeX entry for easy citation.
These additions make the model more discoverable, provide comprehensive information, and offer a clear starting point for users.
|
@@ -1,7 +1,118 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
pipeline_tag: image-text-to-text
|
| 8 |
+
library_name: transformers
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
|
| 12 |
+
|
| 13 |
+
The `GeoVista-RL-6k-7B` model is an agentic model presented in the paper [GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization](https://huggingface.co/papers/2511.15705). This model is designed for geolocalization tasks, requiring nuanced visual grounding and web search to confirm or refine hypotheses during reasoning. GeoVista seamlessly integrates tool invocation, such as an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information, within its reasoning loop.
|
| 14 |
+
|
| 15 |
+
GeoVista achieves strong performance, surpassing other open-source agentic models on the geolocalization task and achieving performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
|
| 16 |
+
|
| 17 |
+
- **Paper**: [GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization](https://huggingface.co/papers/2511.15705)
|
| 18 |
+
- **Project Page**: https://ekonwang.github.io/geo-vista/
|
| 19 |
+
- **GitHub Repository**: https://github.com/ekonwang/GeoVista
|
| 20 |
+
- **GeoBench Dataset**: https://huggingface.co/datasets/LibraTree/GeoBench
|
| 21 |
+
|
| 22 |
+
<div align="center">
|
| 23 |
+
<img src="https://github.com/ekonwang/GeoVista/raw/main/assets/agentic_pipeline.webp" alt="GeoVista Agentic Pipeline" width="70%"/>
|
| 24 |
+
</div>
|
| 25 |
+
|
| 26 |
+
## Usage with Transformers
|
| 27 |
+
|
| 28 |
+
You can use `GeoVista-RL-6k-7B` directly with the Hugging Face Transformers library.
|
| 29 |
+
|
| 30 |
+
First, ensure you have the `transformers` and `accelerate` libraries installed:
|
| 31 |
+
```bash
|
| 32 |
+
pip install transformers accelerate
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
Then, you can perform basic inference as follows. Note that for full agentic behavior involving web search, additional setup (like a Tavily API key and specific deployment with vLLM) as described in the [GitHub repository](https://github.com/ekonwang/GeoVista) might be required. This snippet demonstrates direct VLM capabilities.
|
| 36 |
+
|
| 37 |
+
```python
|
| 38 |
+
from transformers import AutoProcessor, AutoModelForConditionalGeneration
|
| 39 |
+
from PIL import Image
|
| 40 |
+
import torch
|
| 41 |
+
|
| 42 |
+
# Load the model and processor
|
| 43 |
+
model_id = "LibraTree/GeoVista-RL-6k-7B"
|
| 44 |
+
model = AutoModelForConditionalGeneration.from_pretrained(
|
| 45 |
+
model_id,
|
| 46 |
+
torch_dtype=torch.bfloat16, # Use torch.float16 or torch.bfloat16 for efficiency
|
| 47 |
+
device_map="auto",
|
| 48 |
+
trust_remote_code=True
|
| 49 |
+
).eval()
|
| 50 |
+
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
| 51 |
+
|
| 52 |
+
# Load your image (replace with your image path)
|
| 53 |
+
# Example image path from the GitHub repo: https://github.com/ekonwang/GeoVista/blob/main/examples/geobench-example.png
|
| 54 |
+
image = Image.open("examples/geobench-example.png").convert("RGB")
|
| 55 |
+
|
| 56 |
+
# Define the conversational prompt
|
| 57 |
+
messages = [
|
| 58 |
+
{
|
| 59 |
+
"role": "user",
|
| 60 |
+
"content": [
|
| 61 |
+
{"type": "image", "image": image},
|
| 62 |
+
{"type": "text", "text": "Please analyze where is the place."},\
|
| 63 |
+
],
|
| 64 |
+
}
|
| 65 |
+
]
|
| 66 |
+
|
| 67 |
+
# Apply chat template and process inputs
|
| 68 |
+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 69 |
+
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
|
| 70 |
+
|
| 71 |
+
# Generate response
|
| 72 |
+
generated_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
|
| 73 |
+
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
| 74 |
+
|
| 75 |
+
print(generated_text)
|
| 76 |
+
# Example output might include reasoning steps and a final location prediction
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## Benchmark
|
| 80 |
+
|
| 81 |
+
GeoVista was evaluated on the newly curated [GeoBench dataset](https://huggingface.co/datasets/LibraTree/GeoBench), a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models.
|
| 82 |
+
|
| 83 |
+
<p align="center">
|
| 84 |
+
<img src="https://github.com/ekonwang/GeoVista/raw/main/assets/figure-3-benchmark.webp" width="50%">
|
| 85 |
+
</p>
|
| 86 |
+
|
| 87 |
+
GeoBench is the first high-resolution, multi-source, globally annotated dataset to evaluate agentic models’ general geolocalization ability. The benchmark assesses models along five axes: **Global Coverage (GC)**, **Reasonable Localizability (RC)**, **High Resolution (HR)**, **Data Variety (DV)**, and **Nuanced Evaluation (NE)**.
|
| 88 |
+
|
| 89 |
+
| **Benchmark** | **Year** | **GC** | **RC** | **HR** | **DV** | **NE** |
|
| 90 |
+
| :------------ | -------: | :----: | :----: | :----: | :----: | :----: |
|
| 91 |
+
| **[Im2GPS](https://doi.org/10.1109/CVPR.2008.4587784)** | 2008 | ✓ | | | | |
|
| 92 |
+
| **[YFCC4k](https://arxiv.org/abs/1705.04838)** | 2017 | ✓ | | | | |
|
| 93 |
+
| **[Google Landmarks v2](https://arxiv.org/abs/2004.01804)** | 2020 | ✓ | | | | |
|
| 94 |
+
| **[VIGOR](https://arxiv.org/abs/2011.12172)** | 2022 | | | | ✓ | |
|
| 95 |
+
| **[OSV-5M](https://arxiv.org/abs/2404.18873)** | 2024 | ✓ | ✓ | | | ✓ |
|
| 96 |
+
| **[GeoComp](https://doi.org/10.48550/arXiv.2502.13759)** | 2025 | ✓ | ✓ | | | ��� |
|
| 97 |
+
| **GeoBench (ours)** | 2025 | ✓ | ✓ | ✓ | ✓ | ✓ |
|
| 98 |
+
|
| 99 |
+
## Citation
|
| 100 |
+
|
| 101 |
+
If you find this work helpful or inspiring, please consider citing the paper:
|
| 102 |
+
|
| 103 |
+
```bibtex
|
| 104 |
+
@misc{wang2025geovistawebaugmentedagenticvisual,
|
| 105 |
+
title={GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization},
|
| 106 |
+
author={Yikun Wang and Zuyan Liu and Ziyi Wang and Pengfei Liu and Han Hu and Yongming Rao},
|
| 107 |
+
year={2025},
|
| 108 |
+
eprint={2511.15705},
|
| 109 |
+
archivePrefix={arXiv},
|
| 110 |
+
primaryClass={cs.CV},
|
| 111 |
+
url={https://arxiv.org/abs/2511.15705},
|
| 112 |
+
}
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
## Acknowledgements
|
| 116 |
+
|
| 117 |
+
We thank [Tavily](https://www.tavily.com/), [Google Cloud](https://cloud.google.com/) for providing reliable web search API and geocoding services for research use. Also we thank [Mapillary](https://www.mapillary.com/?locale=zh_CN) for providing high-quality street-level images around the world.
|
| 118 |
+
We would like to thank the contributors to the [VeRL](https://github.com/volcengine/verl), [TRL](https://github.com/huggingface/trl), [gpt-researcher](https://github.com/assafelovic/gpt-researcher) and [DeepEyes](https://github.com/Visual-Agent/DeepEyes) repositories, for their open-sourced framework or research.
|