Sat2Cap / README.md

Update README.md

5b59a18 verified 23 days ago

6.07 kB

	---
	license: other
	tags:
	- remote-sensing
	- satellite-imagery
	- aerial-imagery
	- vision-language
	- zero-shot-mapping
	- contrastive-learning
	- clip
	- cvprw-2024
	pipeline_tag: image-feature-extraction
	library_name: transformers
	arxiv: 2307.15904
	---

	# Sat2Cap

	Sat2Cap is a vision-language model for zero-shot mapping from overhead imagery. Given satellite or aerial imagery, Sat2Cap predicts representations aligned with ground-level visual/textual concepts, enabling downstream mapping with free-form natural language queries.

	This model is associated with the CVPRW 2024 paper:

	Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images
	Aayush Dhakal, Adeel Ahmad, Subash Khanal, Srikumar Sastry, Hannah Kerner, Nathan Jacobs

	Sat2Cap is designed for mapping concepts that are difficult to express as a fixed set of land-cover or object classes. Instead of training a separate classifier for every attribute, the model learns a shared representation that can be queried using text.

	## Model Details

	- Developed by: Multimodal Vision Research Laboratory (MVRL), Washington University in St. Louis
	- Model type: Vision-language / cross-view representation model
	- Primary modality: Overhead satellite or aerial imagery
	- Output: Embeddings aligned with CLIP-style ground-level visual/textual representations
	- Task: Zero-shot mapping with free-form text queries
	- Paper: [Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images](https://doi.org/10.1109/CVPRW63382.2024.00058)
	- arXiv: [2307.15904](https://arxiv.org/abs/2307.15904)

	## Intended Use

	Sat2Cap can be used for research on:

	- text-based mapping from overhead imagery
	- weakly supervised remote-sensing representation learning
	- cross-view learning between overhead and ground-level imagery
	- geographic vision-language models
	- retrieval or scoring of locations using free-form textual concepts

	Example queries might include concepts such as seasonal activity, land use, visible human activity, scene ambience, or ground-level attributes that may be correlated with overhead appearance.

	## Out-of-Scope Use

	Sat2Cap should not be used as a standalone system for:

	- safety-critical or emergency-response decisions
	- legal, financial, insurance, or eligibility decisions
	- surveillance or individual-level tracking
	- definitive factual claims about a specific property, person, or event
	- applications where errors in geographic inference could cause harm

	The model predicts likely ground-level concepts from overhead imagery and learned correlations. It does not directly observe ground-level conditions at inference time.

	## How It Works

	Sat2Cap learns from paired overhead and ground-level imagery. For a given location and overhead image, the model predicts the expected CLIP embedding of the associated ground-level scenery. These predicted embeddings can then be compared with text embeddings to support free-form textual mapping.

	The paper reports training on a large-scale weakly supervised dataset of 6.1M paired overhead and ground-level images. Sat2Cap can also incorporate temporal information, allowing it to model concepts that vary over time.

	## Training Data

	Sat2Cap is trained using weak supervision from paired overhead and ground-level imagery. The associated paper reports a dataset of 6.1M overhead/ground-level image pairs.

	Because the model learns from naturally collected imagery, its behavior can reflect geographic coverage patterns, temporal sampling bias, camera/platform bias, and regional imbalances present in the training data.

	## Evaluation

	The Sat2Cap paper evaluates the model's ability to capture ground-level concepts and support large-scale mapping of fine-grained textual queries. Please see the paper for the full experimental protocol, baselines, metrics, and qualitative examples.

	## Limitations and Biases

	Sat2Cap has several important limitations:

	- It infers likely ground-level concepts from overhead imagery rather than directly observing ground-level conditions.
	- Predictions may be unreliable in regions underrepresented in the training data.
	- Seasonal, cultural, economic, and geographic correlations may introduce bias.
	- Fine-grained text queries may produce plausible but incorrect geographic patterns.
	- Temporal behavior depends on the temporal coverage and metadata available during training and inference.
	- The model should be validated on the target region and use case before deployment.

	Users should treat Sat2Cap outputs as research signals or hypotheses, not as authoritative observations.

	## License

	This model is currently marked as research-use only / license pending.

	Before assigning a standard open license to the model weights, please verify the licensing status of:

	- the Bing Maps overhead imagery used during training
	- the YFCC100M/Flickr ground-level imagery and its per-image Creative Commons licenses
	- any upstream CLIP or model initialization weights
	- the intended redistribution rights for the trained checkpoint

	Because the training data includes third-party imagery with its own terms, users should not assume that this checkpoint is approved for commercial use, redistribution, or deployment in production systems unless a separate license explicitly grants those rights.

	## Citation

	If you use this model, please cite:

	```bibtex
	@inproceedings{dhakal2024sat2cap,
	title={Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images},
	author={Dhakal, Aayush and Ahmad, Adeel and Khanal, Subash and Sastry, Srikumar and Kerner, Hannah and Jacobs, Nathan},
	booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
	pages={533--542},
	year={2024},
	doi={10.1109/CVPRW63382.2024.00058}
	}
	```

	## Contact

	For questions, issues, or collaboration inquiries, please contact the Multimodal Vision Research Laboratory:

	- Website: https://mvrl.cse.wustl.edu/
	- Hugging Face: https://huggingface.co/MVRL
	- GitHub: https://github.com/mvrl