FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding
Introduction
We introduce FarSLIP, a vision-language foundation model for remote sensing (RS) that achieves fine-grained vision-language alignment. FarSLIP demonstrates state-of-the-art performance on both fine-grained and image-level tasks, including open-vocabulary semantic segmentation, zero-shot classification, and image-text retrieval. We also construct MGRS-200k, the first multi-granularity image-text dataset for RS. Each image is annotated with both short and long global-level captions, along with multiple object-category pairs.
Table of Contents
Preparation
Installation
Clone this repository.
git clone git@github.com:NJU-LHRS/FarSLIP.git cd FarSLIPCreate a new virtual environment.
conda create -n farslip python=3.10 conda activate farslipInstall dependences.
pip install -r requirements.txt
Checkpoints
You can download all our checkpoints from Huggingface, or selectively download them through the links below.
| Model name | ViT-arch. | Test encoder | OVSS mIoU (%) | ZSC top-1 acc. (%) | Download |
|---|---|---|---|---|---|
| FarSLIP-s1 | ViT-B-32 | Vanilla | 29.87 | 58.64 | FarSLIP1_ViT-B-32 |
| FarSLIP-s1 | ViT-B-16 | LongCLIP | 35.44 | 61.89 | FarSLIP1_ViT-B-16 |
| FarSLIP-s2 | ViT-B-32 | Vanilla | 30.49 | 60.12 | FarSLIP2_ViT-B-32 |
| FarSLIP-s2 | ViT-B-16 | LongCLIP | 35.41 | 62.24 | FarSLIP2_ViT-B-16 |
Dataset
FarSLIP is trained in two stages.
- In the first stage, we use the RS5M dataset. A quick portal to the RS5M dataset: link.
- In the second stage, we use the proposed MGRS-200k dataset, which is available on Huggingface.
Examples from MGRS-200k
Training
Validation data preparation
Stage1
torchrun --nproc_per_node=4 -m open_clip_train.main \ --train-dataset-name RS5M \ --train-data '/your/path/to/rs5m/{pub11,rs3}-train-{0000..0031}.tar' \ --train-dataset-type webdataset \ --train-num-samples 5070186 \ --method farslip1 \ --use-imagecrop-aug \ --local-method randomcrops \ --warmup 1000 \ --batch-size 40 \ --lr 1e-6 \ --wd 1.0 \ --epochs 1 \ --model ViT-B-16 \ --loss-type global_itc distill \ --distill-align roi2pooledStage2
torchrun --nproc_per_node=4 -m open_clip_train.main \ --train-dataset-name MGRS \ --root-train-img-dir '/your/path/to/mgrs/global_imgs/' \ --train-data '/your/path/to/mgrs/text_info.json' \ --train-dataset-type json \ --method farslip2 \ --warmup 250 \ --batch-size 40 \ --lr 4e-9 \ --wd 1.0 \ --epochs 10 \ --model ViT-B-16 \ --loss-type global_itc local_itc \ --local-itc-align cls
Testing
Open-vocabulary semantic segmentation
- Please checkout FarSLIP-OVSS for evaluation of open-vocabulary semantic segmentation in RS images.
OVSS accuracies across RS benchmarks (mIoU, %). G denotes general-domain models, and RS refers to RS-specific models.
f. indicates models specifically designed with fine-grained optimization. All models use an input image size of 224, except TIPS (448)
Zero-shot scene classification
Please refer to SkyScript for scene classification dataset preparation, including 'SkyScript_cls', 'aid', 'eurosat', 'fmow', 'millionaid', 'patternnet', 'rsicb', 'nwpu'.
Replace the BENCHMARK_DATASET_ROOT_DIR in tests/test_scene_classification.py to your own path.
Run testing:
- FarSLIP-s1
python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_$VIT.pt- FarSLIP-s2 with LongCLIP text encoder (supporting long text)
python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP2 --force-quick-gelu --pretrained checkpoints/FarSLIP2_$VIT.pt --use-long-clip$VIToptions:ViT-B-16,ViT-B-32
Zero-shot image-text retrieval
Please refer to SkyScript for image-text retrieval dataset preparation, including 'RSICD', 'RSITMD', 'ucmcaptions', and 'SkyScript-retrieval' ('SkyScript_test_30K_filtered_by_CLIP_openai.csv').
Replace the DATA_CSV_PATH_DICT, SKYSCRIPT_IMAGE_DIR, RETRIEVAL_IMAGE_DIR in tests/test_retrieval.py to your own path.
Run testing:
- FarSLIP-s1
python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_$VIT.pt- FarSLIP-s2 with LongCLIP text encoder (supporting long text)
python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP2 --force-quick-gelu --pretrained checkpoints/FarSLIP2_$VIT.pt --use-long-clip$VIToptions:ViT-B-16,ViT-B-32
Acknowledgement
- We gratitude to the following repositories for their wonderful works: Open-CLIP, CLIPSelf, FineCLIP, Long-CLIP, SkyScript, SegEarth.
Citing
If you find our work is useful, please give us ๐ in GitHub and consider cite our paper:
@article{li2025farslip, title={FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding}, author={Zhenshi Li and Weikang Yu and Dilxat Muhtar and Xueliang Zhang and Pengfeng Xiao and Pedram Ghamisi and Xiao Xiang Zhu}, journal={arXiv preprint arXiv:2511.14901}, year={2025} }