FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding
## Introduction
We introduce FarSLIP, a vision-language foundation model for remote sensing (RS) that achieves fine-grained vision-language alignment. FarSLIP demonstrates state-of-the-art performance on both fine-grained and image-level tasks, including open-vocabulary semantic segmentation, zero-shot classification, and image-text retrieval.
We also construct MGRS-200k, the first multi-granularity image-text dataset for RS. Each image is annotated with both short and long global-level captions, along with multiple object-category pairs.
## Table of Contents
- [Introduction](#Introduction)
- [Preparation](#Preparation)
- [Installation](#Installation)
- [Checkpoints](#Checkpoints)
- [Dataset](#Dataset)
- [Training](#Training)
- [Testing](#Testing)
- [Open-vocabulary semantic segmentation](#open-vocabulary-semantic-segmentation)
- [Zero-shot scene classification](#zero-shot-scene-classification)
- [Zero-shot image-text retrieval](#zero-shot-image-text-retrieval)
- [Acknowledgement](#Acknowledgement)
- [Citing](#Citing)
## Preparation
### Installation
1. Clone this repository.
~~~shell
git clone git@github.com:NJU-LHRS/FarSLIP.git
cd FarSLIP
~~~
2. Create a new virtual environment.
~~~shell
conda create -n farslip python=3.10
conda activate farslip
~~~
3. Install dependences.
~~~shell
pip install -r requirements.txt
~~~
### Checkpoints
You can download all our checkpoints from [Huggingface](https://huggingface.co/ZhenShiL/FarSLIP), or selectively download them through the links below.
| Model name | ViT-arch. | Test encoder | OVSS mIoU (%) | ZSC top-1 acc. (%) | Download |
|-------------|-----------|--------------|----------------|--------------------|----------------|
| FarSLIP-s1 | ViT-B-32 | Vanilla | 29.87 | 58.64 | [FarSLIP1_ViT-B-32](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP1_ViT-B-32.pt?download=true) |
| FarSLIP-s1 | ViT-B-16 | LongCLIP | 35.44 | 61.89 | [FarSLIP1_ViT-B-16](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP1_ViT-B-16.pt?download=true) |
| FarSLIP-s2 | ViT-B-32 | Vanilla | 30.49 | 60.12 | [FarSLIP2_ViT-B-32](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP2_ViT-B-32.pt?download=true) |
| FarSLIP-s2 | ViT-B-16 | LongCLIP | 35.41 | 62.24 | [FarSLIP2_ViT-B-16](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP2_ViT-B-16.pt?download=true) |
### Dataset
FarSLIP is trained in two stages.
+ In the first stage, we use the [RS5M](https://github.com/om-ai-lab/RS5M) dataset. A quick portal to the RS5M dataset: [link](https://huggingface.co/datasets/omlab/RS5M).
+ In the second stage, we use the proposed MGRS-200k dataset, which is available on [Huggingface](https://huggingface.co/datasets/ZhenShiL/MGRS-200k).
[//]: # ()
[//]: # (
OVSS accuracies across RS benchmarks (mIoU, %). G denotes general-domain models, and RS refers to RS-specific models.
f. indicates models specifically designed with fine-grained optimization. All models use an input image size of 224, except TIPS (448)
### Zero-shot scene classification
+ Please refer to [SkyScript](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download-benchmark-datasets) for scene classification dataset preparation, including 'SkyScript_cls', 'aid', 'eurosat', 'fmow', 'millionaid', 'patternnet', 'rsicb', 'nwpu'.
+ Replace the BENCHMARK_DATASET_ROOT_DIR in [tests/test_scene_classification.py](./tests/test_scene_classification.py) to your own path.
+ Run testing:
+ FarSLIP-s1
```
python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_$VIT.pt
```
+ FarSLIP-s2 with LongCLIP text encoder (supporting long text)
```
python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP2 --force-quick-gelu --pretrained checkpoints/FarSLIP2_$VIT.pt --use-long-clip
```
- `$VIT` options: `ViT-B-16`, `ViT-B-32`
Comparison of zero-shot classification accuracies (Top-1 acc., %) of different RS-specific CLIP variants across multiple benchmarks.
### Zero-shot image-text retrieval
+ Please refer to [SkyScript](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download-benchmark-datasets) for image-text retrieval dataset preparation, including 'RSICD', 'RSITMD', 'ucmcaptions', and ['SkyScript-retrieval'](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download) ('SkyScript_test_30K_filtered_by_CLIP_openai.csv').
+ Replace the DATA_CSV_PATH_DICT, SKYSCRIPT_IMAGE_DIR, RETRIEVAL_IMAGE_DIR in [tests/test_retrieval.py](./tests/test_retrieval.py) to your own path.
+ Run testing:
+ FarSLIP-s1
```
python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_$VIT.pt
```
+ FarSLIP-s2 with LongCLIP text encoder (supporting long text)
```
python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP2 --force-quick-gelu --pretrained checkpoints/FarSLIP2_$VIT.pt --use-long-clip
```
- `$VIT` options: `ViT-B-16`, `ViT-B-32`
Comparison of cross-modal retrieval accuracies (%) of different RS-specific CLIP variants across multiple benchmarks. *
indicates models trained with in-hold supervision.
## Acknowledgement
+ We gratitude to the following repositories for their wonderful works: [Open-CLIP](https://github.com/mlfoundations/open_clip), [CLIPSelf](https://github.com/wusize/CLIPSelf), [FineCLIP](https://github.com/Timsty1/FineCLIP), [Long-CLIP](https://github.com/beichenzbc/Long-CLIP), [SkyScript](https://github.com/wangzhecheng/SkyScript), [SegEarth](https://github.com/likyoo/SegEarth-OV).
## Citing
+ If you find our work is useful, please give us 🌟 in GitHub and consider cite our paper:
~~~tex
@article{li2025farslip,
title={FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding},
author={Zhenshi Li and Weikang Yu and Dilxat Muhtar and Xueliang Zhang and Pengfeng Xiao and Pedram Ghamisi and Xiao Xiang Zhu},
journal={arXiv preprint arXiv:2511.14901},
year={2025}
}
~~~