nielsr HF Staff commited on
Commit
25dd9ff
·
verified ·
1 Parent(s): 065ff55

Enhance model card: Add metadata, links, description, and usage example

Browse files

This PR significantly enhances the model card for `LibraTree/GeoVista-RL-6k-7B` by adding crucial metadata and detailed content:

- **Metadata**: Added `pipeline_tag: image-text-to-text` and `library_name: transformers`. The `transformers` compatibility is evidenced by the `config.json` and `architectures` field.
- **Description**: Provided an introductory description of the GeoVista model and its application to geolocalization tasks.
- **Links**: Included direct links to the paper ([GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization](https://huggingface.co/papers/2511.15705)), the official project page, and the GitHub repository.
- **Sample Usage**: Added a Python code snippet demonstrating basic inference with the `transformers` library, which is consistent with the model's `transformers` compatibility and enables the automated "how to use" widget on the Hub.
- **Visuals & Benchmarking**: Incorporated an illustrative image of the agentic pipeline and a section detailing the GeoBench benchmark, along with its comparison table.
- **Citation**: Included the BibTeX entry for easy citation.

These additions make the model more discoverable, provide comprehensive information, and offer a clear starting point for users.

Files changed (1) hide show
  1. README.md +115 -4
README.md CHANGED
@@ -1,7 +1,118 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
+ language:
5
+ - en
6
+ license: apache-2.0
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
+ ---
10
+
11
+ # GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
12
+
13
+ The `GeoVista-RL-6k-7B` model is an agentic model presented in the paper [GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization](https://huggingface.co/papers/2511.15705). This model is designed for geolocalization tasks, requiring nuanced visual grounding and web search to confirm or refine hypotheses during reasoning. GeoVista seamlessly integrates tool invocation, such as an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information, within its reasoning loop.
14
+
15
+ GeoVista achieves strong performance, surpassing other open-source agentic models on the geolocalization task and achieving performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
16
+
17
+ - **Paper**: [GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization](https://huggingface.co/papers/2511.15705)
18
+ - **Project Page**: https://ekonwang.github.io/geo-vista/
19
+ - **GitHub Repository**: https://github.com/ekonwang/GeoVista
20
+ - **GeoBench Dataset**: https://huggingface.co/datasets/LibraTree/GeoBench
21
+
22
+ <div align="center">
23
+ <img src="https://github.com/ekonwang/GeoVista/raw/main/assets/agentic_pipeline.webp" alt="GeoVista Agentic Pipeline" width="70%"/>
24
+ </div>
25
+
26
+ ## Usage with Transformers
27
+
28
+ You can use `GeoVista-RL-6k-7B` directly with the Hugging Face Transformers library.
29
+
30
+ First, ensure you have the `transformers` and `accelerate` libraries installed:
31
+ ```bash
32
+ pip install transformers accelerate
33
+ ```
34
+
35
+ Then, you can perform basic inference as follows. Note that for full agentic behavior involving web search, additional setup (like a Tavily API key and specific deployment with vLLM) as described in the [GitHub repository](https://github.com/ekonwang/GeoVista) might be required. This snippet demonstrates direct VLM capabilities.
36
+
37
+ ```python
38
+ from transformers import AutoProcessor, AutoModelForConditionalGeneration
39
+ from PIL import Image
40
+ import torch
41
+
42
+ # Load the model and processor
43
+ model_id = "LibraTree/GeoVista-RL-6k-7B"
44
+ model = AutoModelForConditionalGeneration.from_pretrained(
45
+ model_id,
46
+ torch_dtype=torch.bfloat16, # Use torch.float16 or torch.bfloat16 for efficiency
47
+ device_map="auto",
48
+ trust_remote_code=True
49
+ ).eval()
50
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
51
+
52
+ # Load your image (replace with your image path)
53
+ # Example image path from the GitHub repo: https://github.com/ekonwang/GeoVista/blob/main/examples/geobench-example.png
54
+ image = Image.open("examples/geobench-example.png").convert("RGB")
55
+
56
+ # Define the conversational prompt
57
+ messages = [
58
+ {
59
+ "role": "user",
60
+ "content": [
61
+ {"type": "image", "image": image},
62
+ {"type": "text", "text": "Please analyze where is the place."},\
63
+ ],
64
+ }
65
+ ]
66
+
67
+ # Apply chat template and process inputs
68
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
69
+ inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
70
+
71
+ # Generate response
72
+ generated_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
73
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
74
+
75
+ print(generated_text)
76
+ # Example output might include reasoning steps and a final location prediction
77
+ ```
78
+
79
+ ## Benchmark
80
+
81
+ GeoVista was evaluated on the newly curated [GeoBench dataset](https://huggingface.co/datasets/LibraTree/GeoBench), a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models.
82
+
83
+ <p align="center">
84
+ <img src="https://github.com/ekonwang/GeoVista/raw/main/assets/figure-3-benchmark.webp" width="50%">
85
+ </p>
86
+
87
+ GeoBench is the first high-resolution, multi-source, globally annotated dataset to evaluate agentic models’ general geolocalization ability. The benchmark assesses models along five axes: **Global Coverage (GC)**, **Reasonable Localizability (RC)**, **High Resolution (HR)**, **Data Variety (DV)**, and **Nuanced Evaluation (NE)**.
88
+
89
+ | **Benchmark** | **Year** | **GC** | **RC** | **HR** | **DV** | **NE** |
90
+ | :------------ | -------: | :----: | :----: | :----: | :----: | :----: |
91
+ | **[Im2GPS](https://doi.org/10.1109/CVPR.2008.4587784)** | 2008 | ✓ | | | | |
92
+ | **[YFCC4k](https://arxiv.org/abs/1705.04838)** | 2017 | ✓ | | | | |
93
+ | **[Google Landmarks v2](https://arxiv.org/abs/2004.01804)** | 2020 | ✓ | | | | |
94
+ | **[VIGOR](https://arxiv.org/abs/2011.12172)** | 2022 | | | | ✓ | |
95
+ | **[OSV-5M](https://arxiv.org/abs/2404.18873)** | 2024 | ✓ | ✓ | | | ✓ |
96
+ | **[GeoComp](https://doi.org/10.48550/arXiv.2502.13759)** | 2025 | ✓ | ✓ | | | ��� |
97
+ | **GeoBench (ours)** | 2025 | ✓ | ✓ | ✓ | ✓ | ✓ |
98
+
99
+ ## Citation
100
+
101
+ If you find this work helpful or inspiring, please consider citing the paper:
102
+
103
+ ```bibtex
104
+ @misc{wang2025geovistawebaugmentedagenticvisual,
105
+ title={GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization},
106
+ author={Yikun Wang and Zuyan Liu and Ziyi Wang and Pengfei Liu and Han Hu and Yongming Rao},
107
+ year={2025},
108
+ eprint={2511.15705},
109
+ archivePrefix={arXiv},
110
+ primaryClass={cs.CV},
111
+ url={https://arxiv.org/abs/2511.15705},
112
+ }
113
+ ```
114
+
115
+ ## Acknowledgements
116
+
117
+ We thank [Tavily](https://www.tavily.com/), [Google Cloud](https://cloud.google.com/) for providing reliable web search API and geocoding services for research use. Also we thank [Mapillary](https://www.mapillary.com/?locale=zh_CN) for providing high-quality street-level images around the world.
118
+ We would like to thank the contributors to the [VeRL](https://github.com/volcengine/verl), [TRL](https://github.com/huggingface/trl), [gpt-researcher](https://github.com/assafelovic/gpt-researcher) and [DeepEyes](https://github.com/Visual-Agent/DeepEyes) repositories, for their open-sourced framework or research.