File size: 4,920 Bytes
e887b57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: mit
pipeline_tag: zero-shot-image-classification
library_name: open_clip
datasets:
- ZhenShiL/MGRS-200k
- omlab/RS5M
tags:
- remote-sensing
---

<h1 align="center"> FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding </h1> 

<p align="center">
    <a href="https://huggingface.co/datasets/ZhenShiL/MGRS-200k">
        <img alt="Hugging Face Dataset" src="https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-blue">
    </a>
    <a href="https://huggingface.co/ZhenShiL/FarSLIP">
        <img alt="Hugging Face Model" src="https://img.shields.io/badge/🤗%20Hugging%20Face-Model-yellow">
    </a>
    <a href="https://huggingface.co/papers/2511.14901">
        <img alt="Hugging Face Paper" src="https://img.shields.io/badge/%F0%9F%97%92%20Paper-2511.14901-b31b1b">
    </a>
</p>

**Paper**: [FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding](https://huggingface.co/papers/2511.14901)
**Code**: [https://github.com/NJU-LHRS/FarSLIP](https://github.com/NJU-LHRS/FarSLIP)

## Introduction
We introduce FarSLIP, a vision-language foundation model for remote sensing (RS) that achieves fine-grained vision-language alignment. FarSLIP demonstrates state-of-the-art performance on both fine-grained and image-level tasks, including open-vocabulary semantic segmentation, zero-shot classification, and image-text retrieval.
We also construct MGRS-200k, the first multi-granularity image-text dataset for RS. Each image is annotated with both short and long global-level captions, along with multiple object-category pairs.

<figure>
<div align="center">
<img src="https://github.com/NJU-LHRS/FarSLIP/raw/main/assets/model.png" width="60%">
</div>
</figure>

## Checkpoints
You can download all our checkpoints from [Huggingface](https://huggingface.co/ZhenShiL/FarSLIP), or selectively download them through the links below.

| Model name  | Architecture | OVSS mIoU (%) | ZSC top-1 accuracy (%) | Download |
|-------------|--------------|---------------|-------------------------|----------------|
| FarSLIP-s1  | ViT-B-32     | 29.87         | 58.64                  | [FarSLIP1_ViT-B-32](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP1_ViT-B-32.pt?download=true) |
| FarSLIP-s2  | ViT-B-32     | 30.49         | 60.12                  | [FarSLIP2_ViT-B-32](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP2_ViT-B-32.pt?download=true) |
| FarSLIP-s1  | ViT-B-16     | 35.44         | 61.89                  | [FarSLIP1_ViT-B-16](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP1_ViT-B-16.pt?download=true) |
| FarSLIP-s2  | ViT-B-16     | 35.41         | 62.24                  | [FarSLIP2_ViT-B-16](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP2_ViT-B-16.pt?download=true) |

## Dataset
FarSLIP is trained in two stages.
+ In the first stage, we use the [RS5M](https://github.com/om-ai-lab/RS5M) dataset. A quick portal to the RS5M dataset: [link](https://huggingface.co/datasets/omlab/RS5M).
+ In the second stage, we use the proposed MGRS-200k dataset, which is available on [Huggingface](https://huggingface.co/datasets/ZhenShiL/MGRS-200k).

<p align="center">
  <img src="https://github.com/NJU-LHRS/FarSLIP/raw/main/assets/dataset.png" width="100%">
  <br>
  <em>Examples from MGRS-200k</em>
</p>

## Usage / Testing

Below is a sample usage for zero-shot scene classification, taken directly from the [official GitHub repository](https://github.com/NJU-LHRS/FarSLIP#zero-shot-scene-classification).

### Zero-shot scene classification
+ Please refer to [SkyScript](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download-benchmark-datasets) for scene classification dataset preparation, including 'SkyScript_cls', 'aid', 'eurosat', 'fmow', 'millionaid', 'patternnet', 'rsicb', 'nwpu'.
+ Replace the `BENCHMARK_DATASET_ROOT_DIR` in `tests/test_scene_classification.py` to your own path.

+ Run testing (e.g. FarSLIP-s1 with ViT-B-32):
```
python -m tests.test_scene_classification --model-arch ViT-B-32 --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_ViT-B-32.pt
```

<figure>
<div align="center">
<img src="https://github.com/NJU-LHRS/FarSLIP/raw/main/assets/classification.png" width="100%">
</div>
<figcaption align="center">
<em>Comparison of zero-shot classification accuracies (Top-1 acc., %) of different RS-specific CLIP variants across multiple benchmarks.</em>
</figcaption>
</figure>

## Citation
If you find our work is useful, please give us ⭐ in GitHub and consider cite our paper:

```tex
@article{li2025farslip,
title={FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding},
author={Zhenshi Li and Weikang Yu and Dilxat Muhtar and Xueliang Zhang and Pengfeng Xiao and Pedram Ghamisi and Xiao Xiang Zhu},
journal={arXiv preprint arXiv:2511.14901},
year={2025}
}
```