File size: 9,689 Bytes
eb1aec4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
<h1 align="center"> FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding </h1>
<p align="center">
<a href="https://huggingface.co/datasets/ZhenShiL/MGRS-200k">
<img alt="Hugging Face Dataset" src="https://img.shields.io/badge/๐ค%20Hugging%20Face-Dataset-blue">
</a>
<a href="https://huggingface.co/ZhenShiL/FarSLIP">
<img alt="Hugging Face Model" src="https://img.shields.io/badge/๐ค%20Hugging%20Face-Model-yellow">
</a>
<a href="https://arxiv.org/abs/2511.14901">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-2511.14901-b31b1b">
</a>
</p>
## Introduction
We introduce FarSLIP, a vision-language foundation model for remote sensing (RS) that achieves fine-grained vision-language alignment. FarSLIP demonstrates state-of-the-art performance on both fine-grained and image-level tasks, including open-vocabulary semantic segmentation, zero-shot classification, and image-text retrieval.
We also construct MGRS-200k, the first multi-granularity image-text dataset for RS. Each image is annotated with both short and long global-level captions, along with multiple object-category pairs.
<figure>
<div align="center">
<img src=assets/model.png width="60%">
</div>
</figure>
## Table of Contents
- [Introduction](#Introduction)
- [Preparation](#Preparation)
- [Installation](#Installation)
- [Checkpoints](#Checkpoints)
- [Dataset](#Dataset)
- [Training](#Training)
- [Testing](#Testing)
- [Open-vocabulary semantic segmentation](#open-vocabulary-semantic-segmentation)
- [Zero-shot scene classification](#zero-shot-scene-classification)
- [Zero-shot image-text retrieval](#zero-shot-image-text-retrieval)
- [Acknowledgement](#Acknowledgement)
- [Citing](#Citing)
## Preparation
### Installation
1. Clone this repository.
~~~shell
git clone git@github.com:NJU-LHRS/FarSLIP.git
cd FarSLIP
~~~
2. Create a new virtual environment.
~~~shell
conda create -n farslip python=3.10
conda activate farslip
~~~
3. Install dependences.
~~~shell
pip install -r requirements.txt
~~~
### Checkpoints
You can download all our checkpoints from [Huggingface](https://huggingface.co/ZhenShiL/FarSLIP), or selectively download them through the links below.
| Model name | ViT-arch. | Test encoder | OVSS mIoU (%) | ZSC top-1 acc. (%) | Download |
|-------------|-----------|--------------|----------------|--------------------|----------------|
| FarSLIP-s1 | ViT-B-32 | Vanilla | 29.87 | 58.64 | [FarSLIP1_ViT-B-32](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP1_ViT-B-32.pt?download=true) |
| FarSLIP-s1 | ViT-B-16 | LongCLIP | 35.44 | 61.89 | [FarSLIP1_ViT-B-16](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP1_ViT-B-16.pt?download=true) |
| FarSLIP-s2 | ViT-B-32 | Vanilla | 30.49 | 60.12 | [FarSLIP2_ViT-B-32](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP2_ViT-B-32.pt?download=true) |
| FarSLIP-s2 | ViT-B-16 | LongCLIP | 35.41 | 62.24 | [FarSLIP2_ViT-B-16](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP2_ViT-B-16.pt?download=true) |
### Dataset
FarSLIP is trained in two stages.
+ In the first stage, we use the [RS5M](https://github.com/om-ai-lab/RS5M) dataset. A quick portal to the RS5M dataset: [link](https://huggingface.co/datasets/omlab/RS5M).
+ In the second stage, we use the proposed MGRS-200k dataset, which is available on [Huggingface](https://huggingface.co/datasets/ZhenShiL/MGRS-200k).
[//]: # (<figure>)
[//]: # (<div align="center">)
[//]: # (<img src=assets/dataset.png width="80%">)
[//]: # (</div>)
[//]: # (<figcaption align="center"><em>Examples from MGRS-200k</em></figcaption>)
[//]: # (</figure>)
<p align="center">
<img src="assets/dataset.png" width="100%">
<br>
<em>Examples from MGRS-200k</em>
</p>
## Training
+ Validation data preparation
+ Replace --root-val-img-dir and --val-data in [config.py](./open_clip_train/config.py) with the paths to your [SkyScript](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download) validation dataset ('SkyScript_val_5K_filtered_by_CLIP_openai').
+ Stage1
~~~shell
torchrun --nproc_per_node=4 -m open_clip_train.main \
--train-dataset-name RS5M \
--train-data '/your/path/to/rs5m/{pub11,rs3}-train-{0000..0031}.tar' \
--train-dataset-type webdataset \
--train-num-samples 5070186 \
--method farslip1 \
--use-imagecrop-aug \
--local-method randomcrops \
--warmup 1000 \
--batch-size 40 \
--lr 1e-6 \
--wd 1.0 \
--epochs 1 \
--model ViT-B-16 \
--loss-type global_itc distill \
--distill-align roi2pooled
~~~
+ Stage2
~~~shell
torchrun --nproc_per_node=4 -m open_clip_train.main \
--train-dataset-name MGRS \
--root-train-img-dir '/your/path/to/mgrs/global_imgs/' \
--train-data '/your/path/to/mgrs/text_info.json' \
--train-dataset-type json \
--method farslip2 \
--warmup 250 \
--batch-size 40 \
--lr 4e-9 \
--wd 1.0 \
--epochs 10 \
--model ViT-B-16 \
--loss-type global_itc local_itc \
--local-itc-align cls
~~~
## Testing
### Open-vocabulary semantic segmentation
+ Please checkout [FarSLIP-OVSS](https://github.com/NJU-LHRS/FarSLIP-OVSS) for evaluation of open-vocabulary semantic segmentation in RS images.
<p align="center">
<img src="assets/ovss.png" width="100%">
<br>
<em>
OVSS accuracies across RS benchmarks (mIoU, %). G denotes general-domain models, and RS refers to RS-specific models.
f. indicates models specifically designed with fine-grained optimization. All models use an input image size of 224, except TIPS (448)
</em>
</p>
### Zero-shot scene classification
+ Please refer to [SkyScript](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download-benchmark-datasets) for scene classification dataset preparation, including 'SkyScript_cls', 'aid', 'eurosat', 'fmow', 'millionaid', 'patternnet', 'rsicb', 'nwpu'.
+ Replace the BENCHMARK_DATASET_ROOT_DIR in [tests/test_scene_classification.py](./tests/test_scene_classification.py) to your own path.
+ Run testing:
+ FarSLIP-s1
```
python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_$VIT.pt
```
<!-- + FarSLIP-s2 with vanilla CLIP text encoder
```
python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP2_VC --force-quick-gelu --pretrained checkpoints/FarSLIP2_VC_$VIT.pt
``` -->
+ FarSLIP-s2 with LongCLIP text encoder (supporting long text)
```
python -m tests.test_scene_classification --model-arch $VIT --model-name FarSLIP2 --force-quick-gelu --pretrained checkpoints/FarSLIP2_$VIT.pt --use-long-clip
```
- `$VIT` options: `ViT-B-16`, `ViT-B-32`
<figure>
<div align="center">
<img src=assets/classification.png width="100%">
</div>
<figcaption align="center">
<em>Comparison of zero-shot classification accuracies (Top-1 acc., %) of different RS-specific CLIP variants across multiple benchmarks.</em>
</figcaption>
</figure>
### Zero-shot image-text retrieval
+ Please refer to [SkyScript](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download-benchmark-datasets) for image-text retrieval dataset preparation, including 'RSICD', 'RSITMD', 'ucmcaptions', and ['SkyScript-retrieval'](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download) ('SkyScript_test_30K_filtered_by_CLIP_openai.csv').
+ Replace the DATA_CSV_PATH_DICT, SKYSCRIPT_IMAGE_DIR, RETRIEVAL_IMAGE_DIR in [tests/test_retrieval.py](./tests/test_retrieval.py) to your own path.
+ Run testing:
+ FarSLIP-s1
```
python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_$VIT.pt
```
<!-- + FarSLIP-s2 with vanilla CLIP text encoder
```
python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP2_VC --force-quick-gelu --pretrained checkpoints/FarSLIP2_VC_$VIT.pt
``` -->
+ FarSLIP-s2 with LongCLIP text encoder (supporting long text)
```
python -m tests.test_retrieval --model-arch $VIT --model-name FarSLIP2 --force-quick-gelu --pretrained checkpoints/FarSLIP2_$VIT.pt --use-long-clip
```
- `$VIT` options: `ViT-B-16`, `ViT-B-32`
<div align="center">
<img src=assets/retrieval.png width="50%">
</div>
<figcaption align="center">
<em>Comparison of cross-modal retrieval accuracies (%) of different RS-specific CLIP variants across multiple benchmarks. *
indicates models trained with in-hold supervision.</em>
</figcaption>
</figure>
## Acknowledgement
+ We gratitude to the following repositories for their wonderful works: [Open-CLIP](https://github.com/mlfoundations/open_clip), [CLIPSelf](https://github.com/wusize/CLIPSelf), [FineCLIP](https://github.com/Timsty1/FineCLIP), [Long-CLIP](https://github.com/beichenzbc/Long-CLIP), [SkyScript](https://github.com/wangzhecheng/SkyScript), [SegEarth](https://github.com/likyoo/SegEarth-OV).
## Citing
+ If you find our work is useful, please give us ๐ in GitHub and consider cite our paper:
~~~tex
@article{li2025farslip,
title={FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding},
author={Zhenshi Li and Weikang Yu and Dilxat Muhtar and Xueliang Zhang and Pengfeng Xiao and Pedram Ghamisi and Xiao Xiang Zhu},
journal={arXiv preprint arXiv:2511.14901},
year={2025}
}
~~~
|