Instructions to use MVRL/Sat2Cap with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MVRL/Sat2Cap with Transformers:
# Load model directly from transformers import AutoTokenizer, CLIPVisionModelWithProjection tokenizer = AutoTokenizer.from_pretrained("MVRL/Sat2Cap") model = CLIPVisionModelWithProjection.from_pretrained("MVRL/Sat2Cap") - Notebooks
- Google Colab
- Kaggle
license: other
tags:
- remote-sensing
- satellite-imagery
- aerial-imagery
- vision-language
- zero-shot-mapping
- contrastive-learning
- clip
- cvprw-2024
pipeline_tag: image-feature-extraction
library_name: transformers
arxiv: 2307.15904
Sat2Cap
Sat2Cap is a vision-language model for zero-shot mapping from overhead imagery. Given satellite or aerial imagery, Sat2Cap predicts representations aligned with ground-level visual/textual concepts, enabling downstream mapping with free-form natural language queries.
This model is associated with the CVPRW 2024 paper:
Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images
Aayush Dhakal, Adeel Ahmad, Subash Khanal, Srikumar Sastry, Hannah Kerner, Nathan Jacobs
Sat2Cap is designed for mapping concepts that are difficult to express as a fixed set of land-cover or object classes. Instead of training a separate classifier for every attribute, the model learns a shared representation that can be queried using text.
Model Details
- Developed by: Multimodal Vision Research Laboratory (MVRL), Washington University in St. Louis
- Model type: Vision-language / cross-view representation model
- Primary modality: Overhead satellite or aerial imagery
- Output: Embeddings aligned with CLIP-style ground-level visual/textual representations
- Task: Zero-shot mapping with free-form text queries
- Paper: Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images
- arXiv: 2307.15904
Intended Use
Sat2Cap can be used for research on:
- text-based mapping from overhead imagery
- weakly supervised remote-sensing representation learning
- cross-view learning between overhead and ground-level imagery
- geographic vision-language models
- retrieval or scoring of locations using free-form textual concepts
Example queries might include concepts such as seasonal activity, land use, visible human activity, scene ambience, or ground-level attributes that may be correlated with overhead appearance.
Out-of-Scope Use
Sat2Cap should not be used as a standalone system for:
- safety-critical or emergency-response decisions
- legal, financial, insurance, or eligibility decisions
- surveillance or individual-level tracking
- definitive factual claims about a specific property, person, or event
- applications where errors in geographic inference could cause harm
The model predicts likely ground-level concepts from overhead imagery and learned correlations. It does not directly observe ground-level conditions at inference time.
How It Works
Sat2Cap learns from paired overhead and ground-level imagery. For a given location and overhead image, the model predicts the expected CLIP embedding of the associated ground-level scenery. These predicted embeddings can then be compared with text embeddings to support free-form textual mapping.
The paper reports training on a large-scale weakly supervised dataset of 6.1M paired overhead and ground-level images. Sat2Cap can also incorporate temporal information, allowing it to model concepts that vary over time.
Training Data
Sat2Cap is trained using weak supervision from paired overhead and ground-level imagery. The associated paper reports a dataset of 6.1M overhead/ground-level image pairs.
Because the model learns from naturally collected imagery, its behavior can reflect geographic coverage patterns, temporal sampling bias, camera/platform bias, and regional imbalances present in the training data.
Evaluation
The Sat2Cap paper evaluates the model's ability to capture ground-level concepts and support large-scale mapping of fine-grained textual queries. Please see the paper for the full experimental protocol, baselines, metrics, and qualitative examples.
Limitations and Biases
Sat2Cap has several important limitations:
- It infers likely ground-level concepts from overhead imagery rather than directly observing ground-level conditions.
- Predictions may be unreliable in regions underrepresented in the training data.
- Seasonal, cultural, economic, and geographic correlations may introduce bias.
- Fine-grained text queries may produce plausible but incorrect geographic patterns.
- Temporal behavior depends on the temporal coverage and metadata available during training and inference.
- The model should be validated on the target region and use case before deployment.
Users should treat Sat2Cap outputs as research signals or hypotheses, not as authoritative observations.
License
This model is currently marked as research-use only / license pending.
Before assigning a standard open license to the model weights, please verify the licensing status of:
- the Bing Maps overhead imagery used during training
- the YFCC100M/Flickr ground-level imagery and its per-image Creative Commons licenses
- any upstream CLIP or model initialization weights
- the intended redistribution rights for the trained checkpoint
Because the training data includes third-party imagery with its own terms, users should not assume that this checkpoint is approved for commercial use, redistribution, or deployment in production systems unless a separate license explicitly grants those rights.
Citation
If you use this model, please cite:
@inproceedings{dhakal2024sat2cap,
title={Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images},
author={Dhakal, Aayush and Ahmad, Adeel and Khanal, Subash and Sastry, Srikumar and Kerner, Hannah and Jacobs, Nathan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
pages={533--542},
year={2024},
doi={10.1109/CVPRW63382.2024.00058}
}
Contact
For questions, issues, or collaboration inquiries, please contact the Multimodal Vision Research Laboratory:
- Website: https://mvrl.cse.wustl.edu/
- Hugging Face: https://huggingface.co/MVRL
- GitHub: https://github.com/mvrl