RSCLIP Collections
Collection
A collection of Remote Sensing CLIP models in both huggingface/transformers and huggingface/diffusers text encoder production ready style
•
10 items
•
Updated
This repository provides a Hugging Face-format vision encoder (ViT-L-14) port of the original SkyCLIP model.
SkyCLIP is a vision-language foundation model for remote sensing, trained on a large-scale image-text dataset collected for aerial and satellite imagery. This repository only provides the ViT-L-14 vision encoder, converted from the original timm checkpoint to Hugging Face CLIPVisionModel format.
The recommended image transforms are as follows:
Compose(
Resize(size=224, interpolation=bicubic, max_size=None, antialias=True),
CenterCrop(size=(224, 224)),
ToTensor(),
Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)
If you use the SkyCLIP model or dataset, please cite the original work:
@article{wangSkyScriptLargeSemantically2024,
title = {{{SkyScript}}: {{A Large}} and {{Semantically Diverse Vision-Language Dataset}} for {{Remote Sensing}}},
shorttitle = {{{SkyScript}}},
author = {Wang, Zhecheng and Prabha, Rajanie and Huang, Tianyuan and Wu, Jiajun and Rajagopal, Ram},
year = 2024,
month = mar,
journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
volume = {38},
number = {6},
pages = {5805--5813},
issn = {2374-3468},
doi = {10.1609/aaai.v38i6.28393},
urldate = {2024-07-06},
copyright = {Copyright (c) 2024 Association for the Advancement of Artificial Intelligence},
keywords = {ML: Multimodal Learning},
annotation = {CCF: A}
}