Instructions to use MVRL/Sat2Cap with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MVRL/Sat2Cap with Transformers:
# Load model directly from transformers import AutoTokenizer, CLIPVisionModelWithProjection tokenizer = AutoTokenizer.from_pretrained("MVRL/Sat2Cap") model = CLIPVisionModelWithProjection.from_pretrained("MVRL/Sat2Cap") - Notebooks
- Google Colab
- Kaggle
| license: other | |
| tags: | |
| - remote-sensing | |
| - satellite-imagery | |
| - aerial-imagery | |
| - vision-language | |
| - zero-shot-mapping | |
| - contrastive-learning | |
| - clip | |
| - cvprw-2024 | |
| pipeline_tag: image-feature-extraction | |
| library_name: transformers | |
| arxiv: 2307.15904 | |
| # Sat2Cap | |
| Sat2Cap is a vision-language model for **zero-shot mapping** from overhead imagery. Given satellite or aerial imagery, Sat2Cap predicts representations aligned with ground-level visual/textual concepts, enabling downstream mapping with free-form natural language queries. | |
| This model is associated with the CVPRW 2024 paper: | |
| **Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images** | |
| Aayush Dhakal, Adeel Ahmad, Subash Khanal, Srikumar Sastry, Hannah Kerner, Nathan Jacobs | |
| Sat2Cap is designed for mapping concepts that are difficult to express as a fixed set of land-cover or object classes. Instead of training a separate classifier for every attribute, the model learns a shared representation that can be queried using text. | |
| ## Model Details | |
| - **Developed by:** Multimodal Vision Research Laboratory (MVRL), Washington University in St. Louis | |
| - **Model type:** Vision-language / cross-view representation model | |
| - **Primary modality:** Overhead satellite or aerial imagery | |
| - **Output:** Embeddings aligned with CLIP-style ground-level visual/textual representations | |
| - **Task:** Zero-shot mapping with free-form text queries | |
| - **Paper:** [Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images](https://doi.org/10.1109/CVPRW63382.2024.00058) | |
| - **arXiv:** [2307.15904](https://arxiv.org/abs/2307.15904) | |
| ## Intended Use | |
| Sat2Cap can be used for research on: | |
| - text-based mapping from overhead imagery | |
| - weakly supervised remote-sensing representation learning | |
| - cross-view learning between overhead and ground-level imagery | |
| - geographic vision-language models | |
| - retrieval or scoring of locations using free-form textual concepts | |
| Example queries might include concepts such as seasonal activity, land use, visible human activity, scene ambience, or ground-level attributes that may be correlated with overhead appearance. | |
| ## Out-of-Scope Use | |
| Sat2Cap should not be used as a standalone system for: | |
| - safety-critical or emergency-response decisions | |
| - legal, financial, insurance, or eligibility decisions | |
| - surveillance or individual-level tracking | |
| - definitive factual claims about a specific property, person, or event | |
| - applications where errors in geographic inference could cause harm | |
| The model predicts likely ground-level concepts from overhead imagery and learned correlations. It does not directly observe ground-level conditions at inference time. | |
| ## How It Works | |
| Sat2Cap learns from paired overhead and ground-level imagery. For a given location and overhead image, the model predicts the expected CLIP embedding of the associated ground-level scenery. These predicted embeddings can then be compared with text embeddings to support free-form textual mapping. | |
| The paper reports training on a large-scale weakly supervised dataset of **6.1M paired overhead and ground-level images**. Sat2Cap can also incorporate temporal information, allowing it to model concepts that vary over time. | |
| ## Training Data | |
| Sat2Cap is trained using weak supervision from paired overhead and ground-level imagery. The associated paper reports a dataset of **6.1M overhead/ground-level image pairs**. | |
| Because the model learns from naturally collected imagery, its behavior can reflect geographic coverage patterns, temporal sampling bias, camera/platform bias, and regional imbalances present in the training data. | |
| ## Evaluation | |
| The Sat2Cap paper evaluates the model's ability to capture ground-level concepts and support large-scale mapping of fine-grained textual queries. Please see the paper for the full experimental protocol, baselines, metrics, and qualitative examples. | |
| ## Limitations and Biases | |
| Sat2Cap has several important limitations: | |
| - It infers likely ground-level concepts from overhead imagery rather than directly observing ground-level conditions. | |
| - Predictions may be unreliable in regions underrepresented in the training data. | |
| - Seasonal, cultural, economic, and geographic correlations may introduce bias. | |
| - Fine-grained text queries may produce plausible but incorrect geographic patterns. | |
| - Temporal behavior depends on the temporal coverage and metadata available during training and inference. | |
| - The model should be validated on the target region and use case before deployment. | |
| Users should treat Sat2Cap outputs as research signals or hypotheses, not as authoritative observations. | |
| ## License | |
| This model is currently marked as **research-use only / license pending**. | |
| Before assigning a standard open license to the model weights, please verify the licensing status of: | |
| - the Bing Maps overhead imagery used during training | |
| - the YFCC100M/Flickr ground-level imagery and its per-image Creative Commons licenses | |
| - any upstream CLIP or model initialization weights | |
| - the intended redistribution rights for the trained checkpoint | |
| Because the training data includes third-party imagery with its own terms, users should not assume that this checkpoint is approved for commercial use, redistribution, or deployment in production systems unless a separate license explicitly grants those rights. | |
| ## Citation | |
| If you use this model, please cite: | |
| ```bibtex | |
| @inproceedings{dhakal2024sat2cap, | |
| title={Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images}, | |
| author={Dhakal, Aayush and Ahmad, Adeel and Khanal, Subash and Sastry, Srikumar and Kerner, Hannah and Jacobs, Nathan}, | |
| booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops}, | |
| pages={533--542}, | |
| year={2024}, | |
| doi={10.1109/CVPRW63382.2024.00058} | |
| } | |
| ``` | |
| ## Contact | |
| For questions, issues, or collaboration inquiries, please contact the Multimodal Vision Research Laboratory: | |
| - Website: https://mvrl.cse.wustl.edu/ | |
| - Hugging Face: https://huggingface.co/MVRL | |
| - GitHub: https://github.com/mvrl |