SkyCLIP-ViT-L-14

This repository provides a Hugging Face-format vision encoder (ViT-L-14) port of the original SkyCLIP model.

Original Repository and Links

Description

SkyCLIP is a vision-language foundation model for remote sensing, trained on a large-scale image-text dataset collected for aerial and satellite imagery. This repository only provides the ViT-L-14 vision encoder, converted from the original timm checkpoint to Hugging Face CLIPVisionModel format.

Preprocessing

The recommended image transforms are as follows:

Compose(
    Resize(size=224, interpolation=bicubic, max_size=None, antialias=True),
    CenterCrop(size=(224, 224)),
    ToTensor(),
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)

Citation

If you use the SkyCLIP model or dataset, please cite the original work:

@article{wangSkyScriptLargeSemantically2024,
  title = {{{SkyScript}}: {{A Large}} and {{Semantically Diverse Vision-Language Dataset}} for {{Remote Sensing}}},
  shorttitle = {{{SkyScript}}},
  author = {Wang, Zhecheng and Prabha, Rajanie and Huang, Tianyuan and Wu, Jiajun and Rajagopal, Ram},
  year = 2024,
  month = mar,
  journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume = {38},
  number = {6},
  pages = {5805--5813},
  issn = {2374-3468},
  doi = {10.1609/aaai.v38i6.28393},
  urldate = {2024-07-06},
  copyright = {Copyright (c) 2024 Association for the Advancement of Artificial Intelligence},
  keywords = {ML: Multimodal Learning},
  annotation = {CCF: A}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including BiliSakura/SkyCLIP-ViT-L-14