--- license: mit pipeline_tag: zero-shot-image-classification library_name: open_clip datasets: - ZhenShiL/MGRS-200k - omlab/RS5M tags: - remote-sensing ---

FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

**Paper**: [FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding](https://huggingface.co/papers/2511.14901) **Code**: [https://github.com/NJU-LHRS/FarSLIP](https://github.com/NJU-LHRS/FarSLIP) ## Introduction We introduce FarSLIP, a vision-language foundation model for remote sensing (RS) that achieves fine-grained vision-language alignment. FarSLIP demonstrates state-of-the-art performance on both fine-grained and image-level tasks, including open-vocabulary semantic segmentation, zero-shot classification, and image-text retrieval. We also construct MGRS-200k, the first multi-granularity image-text dataset for RS. Each image is annotated with both short and long global-level captions, along with multiple object-category pairs.

## Checkpoints You can download all our checkpoints from [Huggingface](https://huggingface.co/ZhenShiL/FarSLIP), or selectively download them through the links below. | Model name | Architecture | OVSS mIoU (%) | ZSC top-1 accuracy (%) | Download | |-------------|--------------|---------------|-------------------------|----------------| | FarSLIP-s1 | ViT-B-32 | 29.87 | 58.64 | [FarSLIP1_ViT-B-32](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP1_ViT-B-32.pt?download=true) | | FarSLIP-s2 | ViT-B-32 | 30.49 | 60.12 | [FarSLIP2_ViT-B-32](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP2_ViT-B-32.pt?download=true) | | FarSLIP-s1 | ViT-B-16 | 35.44 | 61.89 | [FarSLIP1_ViT-B-16](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP1_ViT-B-16.pt?download=true) | | FarSLIP-s2 | ViT-B-16 | 35.41 | 62.24 | [FarSLIP2_ViT-B-16](https://huggingface.co/ZhenShiL/FarSLIP/resolve/main/FarSLIP2_ViT-B-16.pt?download=true) | ## Dataset FarSLIP is trained in two stages. + In the first stage, we use the [RS5M](https://github.com/om-ai-lab/RS5M) dataset. A quick portal to the RS5M dataset: [link](https://huggingface.co/datasets/omlab/RS5M). + In the second stage, we use the proposed MGRS-200k dataset, which is available on [Huggingface](https://huggingface.co/datasets/ZhenShiL/MGRS-200k).

Examples from MGRS-200k

## Usage / Testing Below is a sample usage for zero-shot scene classification, taken directly from the [official GitHub repository](https://github.com/NJU-LHRS/FarSLIP#zero-shot-scene-classification). ### Zero-shot scene classification + Please refer to [SkyScript](https://github.com/wangzhecheng/SkyScript?tab=readme-ov-file#download-benchmark-datasets) for scene classification dataset preparation, including 'SkyScript_cls', 'aid', 'eurosat', 'fmow', 'millionaid', 'patternnet', 'rsicb', 'nwpu'. + Replace the `BENCHMARK_DATASET_ROOT_DIR` in `tests/test_scene_classification.py` to your own path. + Run testing (e.g. FarSLIP-s1 with ViT-B-32): ``` python -m tests.test_scene_classification --model-arch ViT-B-32 --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_ViT-B-32.pt ```

*Comparison of zero-shot classification accuracies (Top-1 acc., %) of different RS-specific CLIP variants across multiple benchmarks.*

## Citation If you find our work is useful, please give us ⭐ in GitHub and consider cite our paper: ```tex @article{li2025farslip, title={FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding}, author={Zhenshi Li and Weikang Yu and Dilxat Muhtar and Xueliang Zhang and Pengfeng Xiao and Pedram Ghamisi and Xiao Xiang Zhu}, journal={arXiv preprint arXiv:2511.14901}, year={2025} } ```