ML4RS-Anonymous's picture
Upload all files
eb1aec4 verified

FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

Hugging Face Dataset Hugging Face Model arXiv

Introduction

We introduce FarSLIP, a vision-language foundation model for remote sensing (RS) that achieves fine-grained vision-language alignment. FarSLIP demonstrates state-of-the-art performance on both fine-grained and image-level tasks, including open-vocabulary semantic segmentation, zero-shot classification, and image-text retrieval. We also construct MGRS-200k, the first multi-granularity image-text dataset for RS. Each image is annotated with both short and long global-level captions, along with multiple object-category pairs.

Table of Contents

Preparation

Installation

  1. Clone this repository.

     git clone git@github.com:NJU-LHRS/FarSLIP.git
     cd FarSLIP
    
  2. Create a new virtual environment.

     conda create -n farslip python=3.10
     conda activate farslip
    
  3. Install dependences.

     pip install -r requirements.txt
    

Checkpoints

You can download all our checkpoints from Huggingface, or selectively download them through the links below.

Model name ViT-arch. Test encoder OVSS mIoU (%) ZSC top-1 acc. (%) Download
FarSLIP-s1 ViT-B-32 Vanilla 29.87 58.64 FarSLIP1_ViT-B-32
FarSLIP-s1 ViT-B-16 LongCLIP 35.44 61.89 FarSLIP1_ViT-B-16
FarSLIP-s2 ViT-B-32 Vanilla 30.49 60.12 FarSLIP2_ViT-B-32
FarSLIP-s2 ViT-B-16 LongCLIP 35.41 62.24 FarSLIP2_ViT-B-16

Dataset

FarSLIP is trained in two stages.

  • In the first stage, we use the RS5M dataset. A quick portal to the RS5M dataset: link.
  • In the second stage, we use the proposed MGRS-200k dataset, which is available on Huggingface.


Examples from MGRS-200k

Training

  • Validation data preparation

    • Replace --root-val-img-dir and --val-data in config.py with the paths to your SkyScript validation dataset ('SkyScript_val_5K_filtered_by_CLIP_openai').
  • Stage1

      torchrun --nproc_per_node=4 -m open_clip_train.main \
      --train-dataset-name RS5M \
      --train-data '/your/path/to/rs5m/{pub11,rs3}-train-{0000..0031}.tar' \
      --train-dataset-type webdataset \
      --train-num-samples 5070186 \
      --method farslip1 \
      --use-imagecrop-aug \
      --local-method randomcrops \
      --warmup 1000 \
      --batch-size 40 \
      --lr 1e-6 \
      --wd 1.0 \
      --epochs 1 \
      --model ViT-B-16 \
      --loss-type global_itc distill \
      --distill-align roi2pooled
    
  • Stage2

      torchrun --nproc_per_node=4 -m open_clip_train.main \
      --train-dataset-name MGRS \
      --root-train-img-dir '/your/path/to/mgrs/global_imgs/' \
      --train-data '/your/path/to/mgrs/text_info.json' \
      --train-dataset-type json \
      --method farslip2 \
      --warmup 250 \
      --batch-size 40 \
      --lr 4e-9 \
      --wd 1.0 \
      --epochs 10 \
      --model ViT-B-16 \
      --loss-type global_itc local_itc \
      --local-itc-align cls
    

Testing

Open-vocabulary semantic segmentation

  • Please checkout FarSLIP-OVSS for evaluation of open-vocabulary semantic segmentation in RS images.


OVSS accuracies across RS benchmarks (mIoU, %). G denotes general-domain models, and RS refers to RS-specific models. f. indicates models specifically designed with fine-grained optimization. All models use an input image size of 224, except TIPS (448)

Zero-shot scene classification

  • Please refer to SkyScript for scene classification dataset preparation, including 'SkyScript_cls', 'aid', 'eurosat', 'fmow', 'millionaid', 'patternnet', 'rsicb', 'nwpu'.

  • Replace the BENCHMARK_DATASET_ROOT_DIR in tests/test_scene_classification.py to your own path.

  • Run testing:

    • FarSLIP-s1
    python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_$VIT.pt
    
    • FarSLIP-s2 with LongCLIP text encoder (supporting long text)
    python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP2 --force-quick-gelu --pretrained checkpoints/FarSLIP2_$VIT.pt --use-long-clip
    
    • $VIT options: ViT-B-16, ViT-B-32
Comparison of zero-shot classification accuracies (Top-1 acc., %) of different RS-specific CLIP variants across multiple benchmarks.

Zero-shot image-text retrieval

  • Please refer to SkyScript for image-text retrieval dataset preparation, including 'RSICD', 'RSITMD', 'ucmcaptions', and 'SkyScript-retrieval' ('SkyScript_test_30K_filtered_by_CLIP_openai.csv').

  • Replace the DATA_CSV_PATH_DICT, SKYSCRIPT_IMAGE_DIR, RETRIEVAL_IMAGE_DIR in tests/test_retrieval.py to your own path.

  • Run testing:

    • FarSLIP-s1
    python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_$VIT.pt
    
    • FarSLIP-s2 with LongCLIP text encoder (supporting long text)
    python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP2 --force-quick-gelu --pretrained checkpoints/FarSLIP2_$VIT.pt --use-long-clip
    
    • $VIT options: ViT-B-16, ViT-B-32
Comparison of cross-modal retrieval accuracies (%) of different RS-specific CLIP variants across multiple benchmarks. * indicates models trained with in-hold supervision.

Acknowledgement

Citing

  • If you find our work is useful, please give us ๐ŸŒŸ in GitHub and consider cite our paper:

      @article{li2025farslip,
      title={FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding},
      author={Zhenshi Li and Weikang Yu and Dilxat Muhtar and Xueliang Zhang and Pengfeng Xiao and Pedram Ghamisi and Xiao Xiang Zhu},
      journal={arXiv preprint arXiv:2511.14901},
      year={2025}
      }