Sat2Cap / README.md
jacobsn's picture
Update README.md
5b59a18 verified
|
raw
history blame
6.07 kB
---
license: other
tags:
- remote-sensing
- satellite-imagery
- aerial-imagery
- vision-language
- zero-shot-mapping
- contrastive-learning
- clip
- cvprw-2024
pipeline_tag: image-feature-extraction
library_name: transformers
arxiv: 2307.15904
---
# Sat2Cap
Sat2Cap is a vision-language model for **zero-shot mapping** from overhead imagery. Given satellite or aerial imagery, Sat2Cap predicts representations aligned with ground-level visual/textual concepts, enabling downstream mapping with free-form natural language queries.
This model is associated with the CVPRW 2024 paper:
**Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images**
Aayush Dhakal, Adeel Ahmad, Subash Khanal, Srikumar Sastry, Hannah Kerner, Nathan Jacobs
Sat2Cap is designed for mapping concepts that are difficult to express as a fixed set of land-cover or object classes. Instead of training a separate classifier for every attribute, the model learns a shared representation that can be queried using text.
## Model Details
- **Developed by:** Multimodal Vision Research Laboratory (MVRL), Washington University in St. Louis
- **Model type:** Vision-language / cross-view representation model
- **Primary modality:** Overhead satellite or aerial imagery
- **Output:** Embeddings aligned with CLIP-style ground-level visual/textual representations
- **Task:** Zero-shot mapping with free-form text queries
- **Paper:** [Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images](https://doi.org/10.1109/CVPRW63382.2024.00058)
- **arXiv:** [2307.15904](https://arxiv.org/abs/2307.15904)
## Intended Use
Sat2Cap can be used for research on:
- text-based mapping from overhead imagery
- weakly supervised remote-sensing representation learning
- cross-view learning between overhead and ground-level imagery
- geographic vision-language models
- retrieval or scoring of locations using free-form textual concepts
Example queries might include concepts such as seasonal activity, land use, visible human activity, scene ambience, or ground-level attributes that may be correlated with overhead appearance.
## Out-of-Scope Use
Sat2Cap should not be used as a standalone system for:
- safety-critical or emergency-response decisions
- legal, financial, insurance, or eligibility decisions
- surveillance or individual-level tracking
- definitive factual claims about a specific property, person, or event
- applications where errors in geographic inference could cause harm
The model predicts likely ground-level concepts from overhead imagery and learned correlations. It does not directly observe ground-level conditions at inference time.
## How It Works
Sat2Cap learns from paired overhead and ground-level imagery. For a given location and overhead image, the model predicts the expected CLIP embedding of the associated ground-level scenery. These predicted embeddings can then be compared with text embeddings to support free-form textual mapping.
The paper reports training on a large-scale weakly supervised dataset of **6.1M paired overhead and ground-level images**. Sat2Cap can also incorporate temporal information, allowing it to model concepts that vary over time.
## Training Data
Sat2Cap is trained using weak supervision from paired overhead and ground-level imagery. The associated paper reports a dataset of **6.1M overhead/ground-level image pairs**.
Because the model learns from naturally collected imagery, its behavior can reflect geographic coverage patterns, temporal sampling bias, camera/platform bias, and regional imbalances present in the training data.
## Evaluation
The Sat2Cap paper evaluates the model's ability to capture ground-level concepts and support large-scale mapping of fine-grained textual queries. Please see the paper for the full experimental protocol, baselines, metrics, and qualitative examples.
## Limitations and Biases
Sat2Cap has several important limitations:
- It infers likely ground-level concepts from overhead imagery rather than directly observing ground-level conditions.
- Predictions may be unreliable in regions underrepresented in the training data.
- Seasonal, cultural, economic, and geographic correlations may introduce bias.
- Fine-grained text queries may produce plausible but incorrect geographic patterns.
- Temporal behavior depends on the temporal coverage and metadata available during training and inference.
- The model should be validated on the target region and use case before deployment.
Users should treat Sat2Cap outputs as research signals or hypotheses, not as authoritative observations.
## License
This model is currently marked as **research-use only / license pending**.
Before assigning a standard open license to the model weights, please verify the licensing status of:
- the Bing Maps overhead imagery used during training
- the YFCC100M/Flickr ground-level imagery and its per-image Creative Commons licenses
- any upstream CLIP or model initialization weights
- the intended redistribution rights for the trained checkpoint
Because the training data includes third-party imagery with its own terms, users should not assume that this checkpoint is approved for commercial use, redistribution, or deployment in production systems unless a separate license explicitly grants those rights.
## Citation
If you use this model, please cite:
```bibtex
@inproceedings{dhakal2024sat2cap,
title={Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images},
author={Dhakal, Aayush and Ahmad, Adeel and Khanal, Subash and Sastry, Srikumar and Kerner, Hannah and Jacobs, Nathan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
pages={533--542},
year={2024},
doi={10.1109/CVPRW63382.2024.00058}
}
```
## Contact
For questions, issues, or collaboration inquiries, please contact the Multimodal Vision Research Laboratory:
- Website: https://mvrl.cse.wustl.edu/
- Hugging Face: https://huggingface.co/MVRL
- GitHub: https://github.com/mvrl