---
license: other
tags:
  - remote-sensing
  - satellite-imagery
  - aerial-imagery
  - vision-language
  - zero-shot-mapping
  - contrastive-learning
  - clip
  - cvprw-2024
pipeline_tag: image-feature-extraction
library_name: transformers
arxiv: 2307.15904
---

# Sat2Cap

Sat2Cap is a vision-language model for **zero-shot mapping** from overhead imagery. Given satellite or aerial imagery, Sat2Cap predicts representations aligned with ground-level visual/textual concepts, enabling downstream mapping with free-form natural language queries.

This model is associated with the CVPRW 2024 paper:

**Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images**  
Aayush Dhakal, Adeel Ahmad, Subash Khanal, Srikumar Sastry, Hannah Kerner, Nathan Jacobs

Sat2Cap is designed for mapping concepts that are difficult to express as a fixed set of land-cover or object classes. Instead of training a separate classifier for every attribute, the model learns a shared representation that can be queried using text.

## Model Details

- **Developed by:** Multimodal Vision Research Laboratory (MVRL), Washington University in St. Louis
- **Model type:** Vision-language / cross-view representation model
- **Primary modality:** Overhead satellite or aerial imagery
- **Output:** Embeddings aligned with CLIP-style ground-level visual/textual representations
- **Task:** Zero-shot mapping with free-form text queries
- **Paper:** [Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images](https://doi.org/10.1109/CVPRW63382.2024.00058)
- **arXiv:** [2307.15904](https://arxiv.org/abs/2307.15904)

## Intended Use

Sat2Cap can be used for research on:

- text-based mapping from overhead imagery
- weakly supervised remote-sensing representation learning
- cross-view learning between overhead and ground-level imagery
- geographic vision-language models
- retrieval or scoring of locations using free-form textual concepts

Example queries might include concepts such as seasonal activity, land use, visible human activity, scene ambience, or ground-level attributes that may be correlated with overhead appearance.

## Out-of-Scope Use

Sat2Cap should not be used as a standalone system for:

- safety-critical or emergency-response decisions
- legal, financial, insurance, or eligibility decisions
- surveillance or individual-level tracking
- definitive factual claims about a specific property, person, or event
- applications where errors in geographic inference could cause harm

The model predicts likely ground-level concepts from overhead imagery and learned correlations. It does not directly observe ground-level conditions at inference time.

## How It Works

Sat2Cap learns from paired overhead and ground-level imagery. For a given location and overhead image, the model predicts the expected CLIP embedding of the associated ground-level scenery. These predicted embeddings can then be compared with text embeddings to support free-form textual mapping.

The paper reports training on a large-scale weakly supervised dataset of **6.1M paired overhead and ground-level images**. Sat2Cap can also incorporate temporal information, allowing it to model concepts that vary over time.

## Training Data

Sat2Cap is trained using weak supervision from paired overhead and ground-level imagery. The associated paper reports a dataset of **6.1M overhead/ground-level image pairs**.

Because the model learns from naturally collected imagery, its behavior can reflect geographic coverage patterns, temporal sampling bias, camera/platform bias, and regional imbalances present in the training data.

## Evaluation

The Sat2Cap paper evaluates the model's ability to capture ground-level concepts and support large-scale mapping of fine-grained textual queries. Please see the paper for the full experimental protocol, baselines, metrics, and qualitative examples.

## Limitations and Biases

Sat2Cap has several important limitations:

- It infers likely ground-level concepts from overhead imagery rather than directly observing ground-level conditions.
- Predictions may be unreliable in regions underrepresented in the training data.
- Seasonal, cultural, economic, and geographic correlations may introduce bias.
- Fine-grained text queries may produce plausible but incorrect geographic patterns.
- Temporal behavior depends on the temporal coverage and metadata available during training and inference.
- The model should be validated on the target region and use case before deployment.

Users should treat Sat2Cap outputs as research signals or hypotheses, not as authoritative observations.

## License

This model is currently marked as **research-use only / license pending**.

Before assigning a standard open license to the model weights, please verify the licensing status of:

- the Bing Maps overhead imagery used during training
- the YFCC100M/Flickr ground-level imagery and its per-image Creative Commons licenses
- any upstream CLIP or model initialization weights
- the intended redistribution rights for the trained checkpoint

Because the training data includes third-party imagery with its own terms, users should not assume that this checkpoint is approved for commercial use, redistribution, or deployment in production systems unless a separate license explicitly grants those rights.

## Citation

If you use this model, please cite:

```bibtex
@inproceedings{dhakal2024sat2cap,
  title={Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images},
  author={Dhakal, Aayush and Ahmad, Adeel and Khanal, Subash and Sastry, Srikumar and Kerner, Hannah and Jacobs, Nathan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  pages={533--542},
  year={2024},
  doi={10.1109/CVPRW63382.2024.00058}
}
```

## Contact

For questions, issues, or collaboration inquiries, please contact the Multimodal Vision Research Laboratory:

- Website: https://mvrl.cse.wustl.edu/
- Hugging Face: https://huggingface.co/MVRL
- GitHub: https://github.com/mvrl