| <h1 align="center"> FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding </h1> | |
| <p align="center"> | |
| <a href="https://huggingface.co/datasets/ZhenShiL/MGRS-200k"> | |
| <img alt="Hugging Face Dataset" src="https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-blue"> | |
| </a> | |
| <a href="https://huggingface.co/ZhenShiL/FarSLIP"> | |
| <img alt="Hugging Face Model" src="https://img.shields.io/badge/🤗%20Hugging%20Face-Model-yellow"> | |
| </a> | |
| <a href="https://arxiv.org/abs/2511.14901"> | |
| <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2511.14901-b31b1b"> | |
| </a> | |
| </p> | |
| ## Introduction | |
| We introduce FarSLIP, a vision-language foundation model for remote sensing (RS) that achieves fine-grained vision-language alignment. FarSLIP demonstrates state-of-the-art performance on both fine-grained and image-level tasks, including open-vocabulary semantic segmentation, zero-shot classification, and image-text retrieval. | |
| We also construct MGRS-200k, the first multi-granularity image-text dataset for RS. Each image is annotated with both short and long global-level captions, along with multiple object-category pairs. | |
| <figure> | |
| <div align="center"> | |
| <img src=assets/model.png width="60%"> | |
| </div> | |
| </figure> | |
| ## Table of Contents | |
| - [Introduction](#Introduction) | |
| - [Preparation](#Preparation) | |
| - [Installation](#Installation) | |
| - [Checkpoints](#Checkpoints) | |
| - [Dataset](#Dataset) | |
| - [Training](#Training) | |
| - [Testing](#Testing) | |
| - [Open-vocabulary semantic segmentation](#open-vocabulary-semantic-segmentation) | |
| - [Zero-shot scene classification](#zero-shot-scene-classification) | |
| - [Zero-shot image-text retrieval](#zero-shot-image-text-retrieval) | |
| - [Acknowledgement](#Acknowledgement) | |
| - [Citing](#Citing) | |
| ## Preparation | |
| ### Installation | |
| 1. Clone this repository. | |
| ~~~shell | |
| git clone git@github.com:NJU-LHRS/FarSLIP.git | |
| cd FarSLIP | |
| ~~~ | |
| 2. Create a new virtual environment. | |
| ~~~shell | |
| conda create -n farslip python=3.10 | |
| conda activate farslip | |
| ~~~ | |
| 3. Install dependences. | |
| ~~~shell | |
| pip install -r requirements.txt | |
| ~~~ | |
| ### Checkpoints | |
| You can download all our checkpoints from [Huggingface](https://huggingface.co/ZhenShiL/FarSLIP), or selectively download them through the links below. | |
| | Model name | ViT-arch. | Test encoder | OVSS mIoU (%) | ZSC top-1 acc. (%) | Download | | |
| |-------------|-----------|--------------|----------------|--------------------|----------------| | |
| | FarSLIP-s1 | ViT-B-32 | Vanilla | 29.87 | 58.64 | [FarSLIP1_ViT-B-32](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP1_ViT-B-32.pt?download=true) | | |
| | FarSLIP-s1 | ViT-B-16 | LongCLIP | 35.44 | 61.89 | [FarSLIP1_ViT-B-16](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP1_ViT-B-16.pt?download=true) | | |
| | FarSLIP-s2 | ViT-B-32 | Vanilla | 30.49 | 60.12 | [FarSLIP2_ViT-B-32](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP2_ViT-B-32.pt?download=true) | | |
| | FarSLIP-s2 | ViT-B-16 | LongCLIP | 35.41 | 62.24 | [FarSLIP2_ViT-B-16](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP2_ViT-B-16.pt?download=true) | | |
| ### Dataset | |
| FarSLIP is trained in two stages. | |
| + In the first stage, we use the [RS5M](https://github.com/om-ai-lab/RS5M) dataset. A quick portal to the RS5M dataset: [link](https://huggingface.co/datasets/omlab/RS5M). | |
| + In the second stage, we use the proposed MGRS-200k dataset, which is available on [Huggingface](https://huggingface.co/datasets/ZhenShiL/MGRS-200k). | |
| [//]: # (<figure>) | |
| [//]: # (<div align="center">) | |
| [//]: # (<img src=assets/dataset.png width="80%">) | |
| [//]: # (</div>) | |
| [//]: # (<figcaption align="center"><em>Examples from MGRS-200k</em></figcaption>) | |
| [//]: # (</figure>) | |
| <p align="center"> | |
| <img src="assets/dataset.png" width="100%"> | |
| <br> | |
| <em>Examples from MGRS-200k</em> | |
| </p> | |
| ## Training | |
| + Validation data preparation | |
| + Replace --root-val-img-dir and --val-data in [config.py](./open_clip_train/config.py) with the paths to your [SkyScript](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download) validation dataset ('SkyScript_val_5K_filtered_by_CLIP_openai'). | |
| + Stage1 | |
| ~~~shell | |
| torchrun --nproc_per_node=4 -m open_clip_train.main \ | |
| --train-dataset-name RS5M \ | |
| --train-data '/your/path/to/rs5m/{pub11,rs3}-train-{0000..0031}.tar' \ | |
| --train-dataset-type webdataset \ | |
| --train-num-samples 5070186 \ | |
| --method farslip1 \ | |
| --use-imagecrop-aug \ | |
| --local-method randomcrops \ | |
| --warmup 1000 \ | |
| --batch-size 40 \ | |
| --lr 1e-6 \ | |
| --wd 1.0 \ | |
| --epochs 1 \ | |
| --model ViT-B-16 \ | |
| --loss-type global_itc distill \ | |
| --distill-align roi2pooled | |
| ~~~ | |
| + Stage2 | |
| ~~~shell | |
| torchrun --nproc_per_node=4 -m open_clip_train.main \ | |
| --train-dataset-name MGRS \ | |
| --root-train-img-dir '/your/path/to/mgrs/global_imgs/' \ | |
| --train-data '/your/path/to/mgrs/text_info.json' \ | |
| --train-dataset-type json \ | |
| --method farslip2 \ | |
| --warmup 250 \ | |
| --batch-size 40 \ | |
| --lr 4e-9 \ | |
| --wd 1.0 \ | |
| --epochs 10 \ | |
| --model ViT-B-16 \ | |
| --loss-type global_itc local_itc \ | |
| --local-itc-align cls | |
| ~~~ | |
| ## Testing | |
| ### Open-vocabulary semantic segmentation | |
| + Please checkout [FarSLIP-OVSS](https://github.com/NJU-LHRS/FarSLIP-OVSS) for evaluation of open-vocabulary semantic segmentation in RS images. | |
| <p align="center"> | |
| <img src="assets/ovss.png" width="100%"> | |
| <br> | |
| <em> | |
| OVSS accuracies across RS benchmarks (mIoU, %). G denotes general-domain models, and RS refers to RS-specific models. | |
| f. indicates models specifically designed with fine-grained optimization. All models use an input image size of 224, except TIPS (448) | |
| </em> | |
| </p> | |
| ### Zero-shot scene classification | |
| + Please refer to [SkyScript](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download-benchmark-datasets) for scene classification dataset preparation, including 'SkyScript_cls', 'aid', 'eurosat', 'fmow', 'millionaid', 'patternnet', 'rsicb', 'nwpu'. | |
| + Replace the BENCHMARK_DATASET_ROOT_DIR in [tests/test_scene_classification.py](./tests/test_scene_classification.py) to your own path. | |
| + Run testing: | |
| + FarSLIP-s1 | |
| ``` | |
| python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_$VIT.pt | |
| ``` | |
| <!-- + FarSLIP-s2 with vanilla CLIP text encoder | |
| ``` | |
| python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP2_VC --force-quick-gelu --pretrained checkpoints/FarSLIP2_VC_$VIT.pt | |
| ``` --> | |
| + FarSLIP-s2 with LongCLIP text encoder (supporting long text) | |
| ``` | |
| python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP2 --force-quick-gelu --pretrained checkpoints/FarSLIP2_$VIT.pt --use-long-clip | |
| ``` | |
| - `$VIT` options: `ViT-B-16`, `ViT-B-32` | |
| <figure> | |
| <div align="center"> | |
| <img src=assets/classification.png width="100%"> | |
| </div> | |
| <figcaption align="center"> | |
| <em>Comparison of zero-shot classification accuracies (Top-1 acc., %) of different RS-specific CLIP variants across multiple benchmarks.</em> | |
| </figcaption> | |
| </figure> | |
| ### Zero-shot image-text retrieval | |
| + Please refer to [SkyScript](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download-benchmark-datasets) for image-text retrieval dataset preparation, including 'RSICD', 'RSITMD', 'ucmcaptions', and ['SkyScript-retrieval'](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download) ('SkyScript_test_30K_filtered_by_CLIP_openai.csv'). | |
| + Replace the DATA_CSV_PATH_DICT, SKYSCRIPT_IMAGE_DIR, RETRIEVAL_IMAGE_DIR in [tests/test_retrieval.py](./tests/test_retrieval.py) to your own path. | |
| + Run testing: | |
| + FarSLIP-s1 | |
| ``` | |
| python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_$VIT.pt | |
| ``` | |
| <!-- + FarSLIP-s2 with vanilla CLIP text encoder | |
| ``` | |
| python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP2_VC --force-quick-gelu --pretrained checkpoints/FarSLIP2_VC_$VIT.pt | |
| ``` --> | |
| + FarSLIP-s2 with LongCLIP text encoder (supporting long text) | |
| ``` | |
| python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP2 --force-quick-gelu --pretrained checkpoints/FarSLIP2_$VIT.pt --use-long-clip | |
| ``` | |
| - `$VIT` options: `ViT-B-16`, `ViT-B-32` | |
| <div align="center"> | |
| <img src=assets/retrieval.png width="50%"> | |
| </div> | |
| <figcaption align="center"> | |
| <em>Comparison of cross-modal retrieval accuracies (%) of different RS-specific CLIP variants across multiple benchmarks. * | |
| indicates models trained with in-hold supervision.</em> | |
| </figcaption> | |
| </figure> | |
| ## Acknowledgement | |
| + We gratitude to the following repositories for their wonderful works: [Open-CLIP](https://github.com/mlfoundations/open_clip), [CLIPSelf](https://github.com/wusize/CLIPSelf), [FineCLIP](https://github.com/Timsty1/FineCLIP), [Long-CLIP](https://github.com/beichenzbc/Long-CLIP), [SkyScript](https://github.com/wangzhecheng/SkyScript), [SegEarth](https://github.com/likyoo/SegEarth-OV). | |
| ## Citing | |
| + If you find our work is useful, please give us 🌟 in GitHub and consider cite our paper: | |
| ~~~tex | |
| @article{li2025farslip, | |
| title={FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding}, | |
| author={Zhenshi Li and Weikang Yu and Dilxat Muhtar and Xueliang Zhang and Pengfeng Xiao and Pedram Ghamisi and Xiao Xiang Zhu}, | |
| journal={arXiv preprint arXiv:2511.14901}, | |
| year={2025} | |
| } | |
| ~~~ | |