Enhance model card: Add metadata, links, description, and usage example

This PR significantly enhances the model card for `LibraTree/GeoVista-RL-6k-7B` by adding crucial metadata and detailed content:

- **Metadata**: Added `pipeline_tag: image-text-to-text` and `library_name: transformers`. The `transformers` compatibility is evidenced by the `config.json` and `architectures` field.
- **Description**: Provided an introductory description of the GeoVista model and its application to geolocalization tasks.
- **Links**: Included direct links to the paper ([GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization](https://huggingface.co/papers/2511.15705)), the official project page, and the GitHub repository.
- **Sample Usage**: Added a Python code snippet demonstrating basic inference with the `transformers` library, which is consistent with the model's `transformers` compatibility and enables the automated "how to use" widget on the Hub.
- **Visuals & Benchmarking**: Incorporated an illustrative image of the agentic pipeline and a section detailing the GeoBench benchmark, along with its comparison table.
- **Citation**: Included the BibTeX entry for easy citation.

These additions make the model more discoverable, provide comprehensive information, and offer a clear starting point for users.

Files changed (1) hide show

README.md +115 -4

README.md CHANGED Viewed

@@ -1,7 +1,118 @@
 ---
-license: apache-2.0
-language:
-- en
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
----

 ---
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
+language:
+- en
+license: apache-2.0
+pipeline_tag: image-text-to-text
+library_name: transformers
+---
+# GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
+The `GeoVista-RL-6k-7B` model is an agentic model presented in the paper [GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization](https://huggingface.co/papers/2511.15705). This model is designed for geolocalization tasks, requiring nuanced visual grounding and web search to confirm or refine hypotheses during reasoning. GeoVista seamlessly integrates tool invocation, such as an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information, within its reasoning loop.
+GeoVista achieves strong performance, surpassing other open-source agentic models on the geolocalization task and achieving performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.
+-   **Paper**: [GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization](https://huggingface.co/papers/2511.15705)
+-   **Project Page**: https://ekonwang.github.io/geo-vista/
+-   **GitHub Repository**: https://github.com/ekonwang/GeoVista
+-   **GeoBench Dataset**: https://huggingface.co/datasets/LibraTree/GeoBench
+<div align="center">
+  <img src="https://github.com/ekonwang/GeoVista/raw/main/assets/agentic_pipeline.webp" alt="GeoVista Agentic Pipeline" width="70%"/>
+</div>
+## Usage with Transformers
+You can use `GeoVista-RL-6k-7B` directly with the Hugging Face Transformers library.
+First, ensure you have the `transformers` and `accelerate` libraries installed:
+```bash
+pip install transformers accelerate
+```
+Then, you can perform basic inference as follows. Note that for full agentic behavior involving web search, additional setup (like a Tavily API key and specific deployment with vLLM) as described in the [GitHub repository](https://github.com/ekonwang/GeoVista) might be required. This snippet demonstrates direct VLM capabilities.
+```python
+from transformers import AutoProcessor, AutoModelForConditionalGeneration
+from PIL import Image
+import torch
+# Load the model and processor
+model_id = "LibraTree/GeoVista-RL-6k-7B"
+model = AutoModelForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16, # Use torch.float16 or torch.bfloat16 for efficiency
+    device_map="auto",
+    trust_remote_code=True
+).eval()
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+# Load your image (replace with your image path)
+# Example image path from the GitHub repo: https://github.com/ekonwang/GeoVista/blob/main/examples/geobench-example.png
+image = Image.open("examples/geobench-example.png").convert("RGB")
+# Define the conversational prompt
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": "Please analyze where is the place."},\
+        ],
+    }
+]
+# Apply chat template and process inputs
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
+# Generate response
+generated_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(generated_text)
+# Example output might include reasoning steps and a final location prediction
+```
+## Benchmark
+GeoVista was evaluated on the newly curated [GeoBench dataset](https://huggingface.co/datasets/LibraTree/GeoBench), a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models.
+<p align="center">
+  <img src="https://github.com/ekonwang/GeoVista/raw/main/assets/figure-3-benchmark.webp" width="50%">
+</p>
+GeoBench is the first high-resolution, multi-source, globally annotated dataset to evaluate agentic models’ general geolocalization ability. The benchmark assesses models along five axes: **Global Coverage (GC)**, **Reasonable Localizability (RC)**, **High Resolution (HR)**, **Data Variety (DV)**, and **Nuanced Evaluation (NE)**.
+| **Benchmark** | **Year** | **GC** | **RC** | **HR** | **DV** | **NE** |
+| :------------ | -------: | :----: | :----: | :----: | :----: | :----: |
+| **[Im2GPS](https://doi.org/10.1109/CVPR.2008.4587784)** | 2008 | ✓ | | | | |
+| **[YFCC4k](https://arxiv.org/abs/1705.04838)** | 2017 | ✓ | | | | |
+| **[Google Landmarks v2](https://arxiv.org/abs/2004.01804)** | 2020 | ✓ | | | | |
+| **[VIGOR](https://arxiv.org/abs/2011.12172)** | 2022 | | | | ✓ | |
+| **[OSV-5M](https://arxiv.org/abs/2404.18873)** | 2024 | ✓ | ✓ | | | ✓ |
+| **[GeoComp](https://doi.org/10.48550/arXiv.2502.13759)** | 2025 | ✓ | ✓ | | | ��� |
+| **GeoBench (ours)** | 2025 | ✓ | ✓ | ✓ | ✓ | ✓ |
+## Citation
+If you find this work helpful or inspiring, please consider citing the paper:
+```bibtex
+@misc{wang2025geovistawebaugmentedagenticvisual,
+      title={GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization},
+      author={Yikun Wang and Zuyan Liu and Ziyi Wang and Pengfei Liu and Han Hu and Yongming Rao},
+      year={2025},
+      eprint={2511.15705},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2511.15705},
+}
+```
+## Acknowledgements
+We thank [Tavily](https://www.tavily.com/), [Google Cloud](https://cloud.google.com/) for providing reliable web search API and geocoding services for research use. Also we thank [Mapillary](https://www.mapillary.com/?locale=zh_CN) for providing high-quality street-level images around the world.
+We would like to thank the contributors to the [VeRL](https://github.com/volcengine/verl), [TRL](https://github.com/huggingface/trl), [gpt-researcher](https://github.com/assafelovic/gpt-researcher) and [DeepEyes](https://github.com/Visual-Agent/DeepEyes) repositories, for their open-sourced framework or research.