--- license: other tags: - remote-sensing - satellite-imagery - aerial-imagery - vision-language - zero-shot-mapping - contrastive-learning - clip - cvprw-2024 pipeline_tag: image-feature-extraction library_name: transformers arxiv: 2307.15904 --- # Sat2Cap Sat2Cap is a vision-language model for **zero-shot mapping** from overhead imagery. Given satellite or aerial imagery, Sat2Cap predicts representations aligned with ground-level visual/textual concepts, enabling downstream mapping with free-form natural language queries. This model is associated with the CVPRW 2024 paper: **Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images** Aayush Dhakal, Adeel Ahmad, Subash Khanal, Srikumar Sastry, Hannah Kerner, Nathan Jacobs Sat2Cap is designed for mapping concepts that are difficult to express as a fixed set of land-cover or object classes. Instead of training a separate classifier for every attribute, the model learns a shared representation that can be queried using text. ## Model Details - **Developed by:** Multimodal Vision Research Laboratory (MVRL), Washington University in St. Louis - **Model type:** Vision-language / cross-view representation model - **Primary modality:** Overhead satellite or aerial imagery - **Output:** Embeddings aligned with CLIP-style ground-level visual/textual representations - **Task:** Zero-shot mapping with free-form text queries - **Paper:** [Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images](https://doi.org/10.1109/CVPRW63382.2024.00058) - **arXiv:** [2307.15904](https://arxiv.org/abs/2307.15904) ## Intended Use Sat2Cap can be used for research on: - text-based mapping from overhead imagery - weakly supervised remote-sensing representation learning - cross-view learning between overhead and ground-level imagery - geographic vision-language models - retrieval or scoring of locations using free-form textual concepts Example queries might include concepts such as seasonal activity, land use, visible human activity, scene ambience, or ground-level attributes that may be correlated with overhead appearance. ## Out-of-Scope Use Sat2Cap should not be used as a standalone system for: - safety-critical or emergency-response decisions - legal, financial, insurance, or eligibility decisions - surveillance or individual-level tracking - definitive factual claims about a specific property, person, or event - applications where errors in geographic inference could cause harm The model predicts likely ground-level concepts from overhead imagery and learned correlations. It does not directly observe ground-level conditions at inference time. ## How It Works Sat2Cap learns from paired overhead and ground-level imagery. For a given location and overhead image, the model predicts the expected CLIP embedding of the associated ground-level scenery. These predicted embeddings can then be compared with text embeddings to support free-form textual mapping. The paper reports training on a large-scale weakly supervised dataset of **6.1M paired overhead and ground-level images**. Sat2Cap can also incorporate temporal information, allowing it to model concepts that vary over time. ## Training Data Sat2Cap is trained using weak supervision from paired overhead and ground-level imagery. The associated paper reports a dataset of **6.1M overhead/ground-level image pairs**. Because the model learns from naturally collected imagery, its behavior can reflect geographic coverage patterns, temporal sampling bias, camera/platform bias, and regional imbalances present in the training data. ## Evaluation The Sat2Cap paper evaluates the model's ability to capture ground-level concepts and support large-scale mapping of fine-grained textual queries. Please see the paper for the full experimental protocol, baselines, metrics, and qualitative examples. ## Limitations and Biases Sat2Cap has several important limitations: - It infers likely ground-level concepts from overhead imagery rather than directly observing ground-level conditions. - Predictions may be unreliable in regions underrepresented in the training data. - Seasonal, cultural, economic, and geographic correlations may introduce bias. - Fine-grained text queries may produce plausible but incorrect geographic patterns. - Temporal behavior depends on the temporal coverage and metadata available during training and inference. - The model should be validated on the target region and use case before deployment. Users should treat Sat2Cap outputs as research signals or hypotheses, not as authoritative observations. ## License This model is currently marked as **research-use only / license pending**. Before assigning a standard open license to the model weights, please verify the licensing status of: - the Bing Maps overhead imagery used during training - the YFCC100M/Flickr ground-level imagery and its per-image Creative Commons licenses - any upstream CLIP or model initialization weights - the intended redistribution rights for the trained checkpoint Because the training data includes third-party imagery with its own terms, users should not assume that this checkpoint is approved for commercial use, redistribution, or deployment in production systems unless a separate license explicitly grants those rights. ## Citation If you use this model, please cite: ```bibtex @inproceedings{dhakal2024sat2cap, title={Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images}, author={Dhakal, Aayush and Ahmad, Adeel and Khanal, Subash and Sastry, Srikumar and Kerner, Hannah and Jacobs, Nathan}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops}, pages={533--542}, year={2024}, doi={10.1109/CVPRW63382.2024.00058} } ``` ## Contact For questions, issues, or collaboration inquiries, please contact the Multimodal Vision Research Laboratory: - Website: https://mvrl.cse.wustl.edu/ - Hugging Face: https://huggingface.co/MVRL - GitHub: https://github.com/mvrl