| | --- |
| | license: cc-by-nc-4.0 |
| | language: |
| | - en |
| | pipeline_tag: zero-shot-image-classification |
| | widget: |
| | - src: https://huggingface.co/geolocal/StreetCLIP/resolve/main/nagasaki.jpg |
| | candidate_labels: China, South Korea, Japan, Phillipines, Taiwan, Vietnam, Cambodia |
| | example_title: Countries |
| | - src: https://huggingface.co/geolocal/StreetCLIP/resolve/main/sanfrancisco.jpeg |
| | candidate_labels: San Jose, San Diego, Los Angeles, Las Vegas, San Francisco, Seattle |
| | example_title: Cities |
| | library_name: transformers |
| | tags: |
| | - geolocalization |
| | - geolocation |
| | - geographic |
| | - street |
| | - climate |
| | - clip |
| | - urban |
| | - rural |
| | - multi-modal |
| | --- |
| | # Model Card for StreetCLIP |
| |
|
| | StreetCLIP is a robust foundation model for open-domain image geolocalization and other |
| | geographic and climate-related tasks. |
| |
|
| | Trained on an original dataset of 1.1 million street-level urban and rural geo-tagged images, it achieves |
| | state-of-the-art performance on multiple open-domain image geolocalization benchmarks in zero-shot, |
| | outperforming supervised models trained on millions of images. |
| |
|
| | # Model Description |
| |
|
| | StreetCLIP is a model pretrained by deriving image captions synthetically from image class labels using |
| | a domain-specific caption template. This allows StreetCLIP to transfer its generalized zero-shot learning |
| | capabilities to a specific domain (i.e. the domain of image geolocalization). |
| | StreetCLIP builds on the OpenAI's pretrained large version of CLIP ViT, using 14x14 pixel |
| | patches and images with a 336 pixel side length. |
| |
|
| | ## Model Details |
| |
|
| | - **Model type:** [CLIP](https://openai.com/blog/clip/) |
| | - **Language:** English |
| | - **License:** Create Commons Attribution Non Commercial 4.0 |
| | - **Trained from model:** [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) |
| |
|
| | ## Model Sources |
| |
|
| | - **Paper:** [Preprint](https://arxiv.org/abs/2302.00275) |
| | - **Cite preprint as:** |
| | ```bibtex |
| | @misc{haas2023learning, |
| | title={Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization}, |
| | author={Lukas Haas and Silas Alberti and Michal Skreta}, |
| | year={2023}, |
| | eprint={2302.00275}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV} |
| | } |
| | ``` |
| |
|
| | # Uses |
| |
|
| | StreetCLIP has a deep understanding of the visual features found in street-level urban and rural scenes |
| | and knows how to relate these concepts to specific countries, regions, and cities. Given its training setup, |
| | the following use cases are recommended for StreetCLIP. |
| |
|
| | ## Direct Use |
| |
|
| | StreetCLIP can be used out-of-the box using zero-shot learning to infer the geolocation of images on a country, region, |
| | or city level. Given that StreetCLIP was pretrained on a dataset of street-level urban and rural images, |
| | the best performance can be expected on images from a similar distribution. |
| |
|
| | Broader direct use cases are any zero-shot image classification tasks that rely on urban and rural street-level |
| | understanding or geographical information relating visual clues to their region of origin. |
| |
|
| | ## Downstream Use |
| |
|
| | StreetCLIP can be finetuned for any downstream applications that require geographic or street-level urban or rural |
| | scene understanding. Examples of use cases are the following: |
| |
|
| | **Understanding the Built Environment** |
| |
|
| | - Analyzing building quality |
| | - Building type classifcation |
| | - Building energy efficiency Classification |
| |
|
| | **Analyzing Infrastructure** |
| |
|
| | - Analyzing road quality |
| | - Utility pole maintenance |
| | - Identifying damage from natural disasters or armed conflicts |
| |
|
| | **Understanding the Natural Environment** |
| |
|
| | - Mapping vegetation |
| | - Vegetation classification |
| | - Soil type classifcation |
| | - Tracking deforestation |
| |
|
| | **General Use Cases** |
| |
|
| | - Street-level image segmentation |
| | - Urban and rural scene classification |
| | - Object detection in urban or rural environments |
| | - Improving navigation and self-driving car technology |
| |
|
| | ## Out-of-Scope Use |
| |
|
| | Any use cases attempting to geolocate users' private images are out-of-scope and discouraged. |
| |
|
| | # Bias, Risks, and Limitations |
| |
|
| | StreetCLIP was not trained on social media images or images of identifable people for a reason. As such, any use case |
| | attempting to geolocalize users' private images |
| |
|
| | ## Recommendations |
| | We encourage the community to apply StreetCLIP to applications with significant social impact of which there are many. |
| | The first three categories of potential use cases under Downstream Use list potential use cases with social impact |
| | to explore. |
| |
|
| | ## How to Get Started with the Model |
| |
|
| | Use the code below to get started with the model. |
| |
|
| | ```python |
| | from PIL import Image |
| | import requests |
| | |
| | from transformers import CLIPProcessor, CLIPModel |
| | |
| | model = CLIPModel.from_pretrained("geolocal/StreetCLIP") |
| | processor = CLIPProcessor.from_pretrained("geolocal/StreetCLIP") |
| | |
| | url = "https://huggingface.co/geolocal/StreetCLIP/resolve/main/sanfrancisco.jpeg" |
| | image = Image.open(requests.get(url, stream=True).raw) |
| | |
| | choices = ["San Jose", "San Diego", "Los Angeles", "Las Vegas", "San Francisco"] |
| | inputs = processor(text=choices, images=image, return_tensors="pt", padding=True) |
| | |
| | outputs = model(**inputs) |
| | logits_per_image = outputs.logits_per_image # this is the image-text similarity score |
| | probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities |
| | ``` |
| |
|
| | # Training Details |
| |
|
| | ## Training Data |
| |
|
| | StreetCLIP was trained on an original, unreleased street-level dataset of 1.1 million real-world, |
| | urban and rural images. The data used to train the model comes from 101 countries, biased towards |
| | western countries and not including India and China. |
| |
|
| | ## Preprocessing |
| |
|
| | Same preprocessing as [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336). |
| |
|
| | ## Training Procedure |
| |
|
| | StreetCLIP is initialized with OpenAI's pretrained large version of CLIP ViT and then pretrained using the synthetic |
| | caption domain-specific pretraining method described in the paper corresponding to this work. StreetCLIP was trained |
| | for 3 epochs using an AdamW optimizer with a learning rate of 1e-6 on 3 NVIDIA A100 80GB GPUs, a batch size of 32, |
| | and gradient accumulation of 12 steps. |
| |
|
| | StreetCLIP was trained with the goal of matching images in the batch |
| | with the caption correponding to the correct city, region, and country of the images' origins. |
| |
|
| | # Evaluation |
| |
|
| | StreetCLIP was evaluated in zero-shot on two open-domain image geolocalization benchmarks using a |
| | technique called hierarchical linear probing. Hierarchical linear probing sequentially attempts to |
| | identify the correct country and then city of geographical image origin. |
| |
|
| | ## Testing Data and Metrics |
| |
|
| | ### Testing Data |
| |
|
| | StreetCLIP was evaluated on the following two open-domain image geolocalization benchmarks. |
| |
|
| | * [IM2GPS](http://graphics.cs.cmu.edu/projects/im2gps/). |
| | * [IM2GPS3K](https://github.com/lugiavn/revisiting-im2gps) |
| |
|
| | ### Metrics |
| |
|
| | The objective of the listed benchmark datasets is to predict the images' coordinates of origin with as |
| | little deviation as possible. A common metric set forth in prior literature is called Percentage at Kilometer (% @ KM). |
| | The Percentage at Kilometer metric first calculates the distance in kilometers between the predicted coordinates |
| | to the ground truth coordinates and then looks at what percentage of error distances are below a certain kilometer threshold. |
| |
|
| | ## Results |
| |
|
| | **IM2GPS** |
| | | Model | 25km | 200km | 750km | 2,500km | |
| | |----------|:-------------:|:------:|:------:|:------:| |
| | | PlaNet (2016) | 24.5 | 37.6 | 53.6 | 71.3 | |
| | | ISNs (2018) | 43.0 | 51.9 | 66.7 | 80.2 | |
| | | TransLocator (2022) | **48.1** | **64.6** | **75.6** | 86.7 | |
| | | **Zero-Shot CLIP (ours)** | 27.0 | 42.2 | 71.7 | 86.9 | |
| | | **Zero-Shot StreetCLIP (ours)** | 28.3 | 45.1 | 74.7 | **88.2** | |
| | Metric: Percentage at Kilometer (% @ KM) |
| |
|
| | **IM2GPS3K** |
| | | Model | 25km | 200km | 750km | 2,500km | |
| | |----------|:-------------:|:------:|:------:|:------:| |
| | | PlaNet (2016) | 24.8 | 34.3 | 48.4 | 64.6 | |
| | | ISNs (2018) | 28.0 | 36.6 | 49.7 | 66.0 | |
| | | TransLocator (2022) | **31.1** | **46.7** | 58.9 | 80.1 | |
| | | **Zero-Shot CLIP (ours)** | 19.5 | 34.0 | 60.0 | 78.1 | |
| | | **Zero-Shot StreetCLIP (ours)** | 22.4 | 37.4 | **61.3** | **80.4** | |
| | Metric: Percentage at Kilometer (% @ KM) |
| |
|
| |
|
| | ### Summary |
| |
|
| | Our experiments demonstrate that our synthetic caption pretraining method is capable of significantly |
| | improving CLIP's generalized zero-shot capabilities applied to open-domain image geolocalization while |
| | achieving state-of-the-art performance on a selection of benchmark metrics. |
| |
|
| | # Environmental Impact |
| |
|
| | - **Hardware Type:** 4 NVIDIA A100 GPUs |
| | - **Hours used:** 12 |
| |
|
| | # Citation |
| |
|
| | Cite preprint as: |
| |
|
| | ```bibtex |
| | @misc{haas2023learning, |
| | title={Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization}, |
| | author={Lukas Haas and Silas Alberti and Michal Skreta}, |
| | year={2023}, |
| | eprint={2302.00275}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV} |
| | } |
| | ``` |
| |
|