Sat2Cap / README.md

Update README.md

5b59a18 verified 23 days ago

6.07 kB

license: other
tags:
  - remote-sensing
  - satellite-imagery
  - aerial-imagery
  - vision-language
  - zero-shot-mapping
  - contrastive-learning
  - clip
  - cvprw-2024
pipeline_tag: image-feature-extraction
library_name: transformers
arxiv: 2307.15904

Sat2Cap

Sat2Cap is a vision-language model for zero-shot mapping from overhead imagery. Given satellite or aerial imagery, Sat2Cap predicts representations aligned with ground-level visual/textual concepts, enabling downstream mapping with free-form natural language queries.

This model is associated with the CVPRW 2024 paper:

Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images
Aayush Dhakal, Adeel Ahmad, Subash Khanal, Srikumar Sastry, Hannah Kerner, Nathan Jacobs

Sat2Cap is designed for mapping concepts that are difficult to express as a fixed set of land-cover or object classes. Instead of training a separate classifier for every attribute, the model learns a shared representation that can be queried using text.

Model Details

Developed by: Multimodal Vision Research Laboratory (MVRL), Washington University in St. Louis
Model type: Vision-language / cross-view representation model
Primary modality: Overhead satellite or aerial imagery
Output: Embeddings aligned with CLIP-style ground-level visual/textual representations
Task: Zero-shot mapping with free-form text queries
Paper: Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images
arXiv: 2307.15904

Intended Use

Sat2Cap can be used for research on:

text-based mapping from overhead imagery
weakly supervised remote-sensing representation learning
cross-view learning between overhead and ground-level imagery
geographic vision-language models
retrieval or scoring of locations using free-form textual concepts

Example queries might include concepts such as seasonal activity, land use, visible human activity, scene ambience, or ground-level attributes that may be correlated with overhead appearance.

Out-of-Scope Use

Sat2Cap should not be used as a standalone system for:

safety-critical or emergency-response decisions
legal, financial, insurance, or eligibility decisions
surveillance or individual-level tracking
definitive factual claims about a specific property, person, or event
applications where errors in geographic inference could cause harm

The model predicts likely ground-level concepts from overhead imagery and learned correlations. It does not directly observe ground-level conditions at inference time.

How It Works

Sat2Cap learns from paired overhead and ground-level imagery. For a given location and overhead image, the model predicts the expected CLIP embedding of the associated ground-level scenery. These predicted embeddings can then be compared with text embeddings to support free-form textual mapping.

The paper reports training on a large-scale weakly supervised dataset of 6.1M paired overhead and ground-level images. Sat2Cap can also incorporate temporal information, allowing it to model concepts that vary over time.

Training Data

Sat2Cap is trained using weak supervision from paired overhead and ground-level imagery. The associated paper reports a dataset of 6.1M overhead/ground-level image pairs.

Because the model learns from naturally collected imagery, its behavior can reflect geographic coverage patterns, temporal sampling bias, camera/platform bias, and regional imbalances present in the training data.

Evaluation

The Sat2Cap paper evaluates the model's ability to capture ground-level concepts and support large-scale mapping of fine-grained textual queries. Please see the paper for the full experimental protocol, baselines, metrics, and qualitative examples.

Limitations and Biases

Sat2Cap has several important limitations:

It infers likely ground-level concepts from overhead imagery rather than directly observing ground-level conditions.
Predictions may be unreliable in regions underrepresented in the training data.
Seasonal, cultural, economic, and geographic correlations may introduce bias.
Fine-grained text queries may produce plausible but incorrect geographic patterns.
Temporal behavior depends on the temporal coverage and metadata available during training and inference.
The model should be validated on the target region and use case before deployment.

Users should treat Sat2Cap outputs as research signals or hypotheses, not as authoritative observations.

License

This model is currently marked as research-use only / license pending.

Before assigning a standard open license to the model weights, please verify the licensing status of:

the Bing Maps overhead imagery used during training
the YFCC100M/Flickr ground-level imagery and its per-image Creative Commons licenses
any upstream CLIP or model initialization weights
the intended redistribution rights for the trained checkpoint

Because the training data includes third-party imagery with its own terms, users should not assume that this checkpoint is approved for commercial use, redistribution, or deployment in production systems unless a separate license explicitly grants those rights.

Citation

If you use this model, please cite:

@inproceedings{dhakal2024sat2cap,
  title={Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images},
  author={Dhakal, Aayush and Ahmad, Adeel and Khanal, Subash and Sastry, Srikumar and Kerner, Hannah and Jacobs, Nathan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  pages={533--542},
  year={2024},
  doi={10.1109/CVPRW63382.2024.00058}
}

Contact

For questions, issues, or collaboration inquiries, please contact the Multimodal Vision Research Laboratory:

Website: https://mvrl.cse.wustl.edu/
Hugging Face: https://huggingface.co/MVRL
GitHub: https://github.com/mvrl