File size: 5,726 Bytes
9617dfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
license: apache-2.0
---


<h1 align="center">MF-RSVLM</h1>
<p align="center">
  <strong>FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing</strong>
</p>

<p align="center">
  <a href="https://arxiv.org/abs/2512.24022" target="_blank">
    <img src="https://img.shields.io/badge/arXiv-2512.24022-B31B1B.svg" alt="arXiv Badge"/>
  </a>
  <a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft" target="_blank">
    <img src="https://img.shields.io/badge/HuggingFace-Model-yellow" alt="Hugging Face Model"/>
  </a>
  <a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT" target="_blank">
    <img src="https://img.shields.io/badge/HuggingFace-Dataset-yellow" alt="Hugging Face Dataset"/>
  </a>
  <img src="https://komarev.com/ghpvc/?username=Yunkaidang&color=blue" alt="GitHub Views"/>
</p>

<p align="center">
  <a href="https://github.com/Yunkaidang/RSVLM">Project Page</a> |
  <a href="https://arxiv.org/abs/2512.24022">Paper</a> |
  <a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft">Model</a> |
  <a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT">Dataset</a>
</p>

> If this project helps you, please give us a star on GitHub.

## Overview
MF-RSVLM is a remote sensing vision-language model (VLM). It combines a CLIP vision encoder, a two-layer MLP projector, and a Vicuna-7B LLM, and is trained in two stages for modality alignment and instruction following.

- Visual Encoder: CLIP ViT-L/14 336px
- Projector: 2-layer MLP
- LLM: Vicuna-7B v1.5
- Training: Pretrain (VersaD 1.4M image-text pairs) + SFT (instruction tuning)

## Contents
- [Install](#install)
- [Repository Layout](#repository-layout)
- [Downloads](#downloads)
- [Training](#training)
- [Inference Demos](#inference-demos)
- [Evaluation](#evaluation)
- [Citation](#citation)


## Install
```bash
git clone git@github.com:opendatalab/MF-RSVLM.git
cd MF-RSVLM
conda create -n mf-rsvlm
conda activate mf-rsvlm
pip install -r requirements.txt
```

## Repository Layout
```
MF-RSVLM/
β”œβ”€β”€ mfrsvlm/               # package code
β”‚   β”œβ”€β”€ model/             # deepstack, builder, consolidate
β”‚   β”œβ”€β”€ train/             # train_mem.py, train.py, trainer
β”‚   β”œβ”€β”€ conversation.py
β”‚   β”œβ”€β”€ constants.py
β”‚   β”œβ”€β”€ mm_utils.py
β”‚   └── utils.py
β”œβ”€β”€ scripts/               # inference/eval/data-prep helpers + ZeRO configs
β”‚   └── data/
β”œβ”€β”€ checkpoints/           # mf-rsvlm-7b_pretrained, mf-rsvlm-7b_sft
β”œβ”€β”€ models/                # vicuna-7b-v1.5, clip-vit-large-patch14-336, llava-mlp2x
β”œβ”€β”€ requirements.txt
└── README.md
```

## Downloads
### Models
| Name | Link | Description |
|---|---|---|
| MF-RSVLM Pretrain | https://huggingface.co/FelixKAI/mf_rsvlm_7b_pretrained | Pretrain stage |
| MF-RSVLM SFT | https://huggingface.co/FelixKAI/mfrsvlm-7b_sft | SFT stage|
| CLIP Pretrain | https://huggingface.co/openai/clip-vit-large-patch14-336 | Pretraining stage vision tower |
| Vicuna-7B| https://huggingface.co/lmsys/vicuna-7b-v1.5 | Pretraining stage Language tower |
| LLaVA-1.5 MLP Projector | https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/tree/main | MLP projector weights |

### Datasets
- Pretrain data: https://huggingface.co/datasets/FitzPC/VHM_VersaD
- SFT data: https://huggingface.co/datasets/FelixKAI/RSVLM-SFT


## Training
MF-RSVLM training has two stages: pretraining for modality alignment, and supervised fine-tuning (SFT) for instruction following.

### Pretrain
Run the Slurm script below to start pretraining:
```bash
sh scripts/rs/slurm_pretrain.sh
```

### Supervised Fine-Tuning
Run the Slurm script below to start SFT:
```bash
sh scripts/rs/slurm_finetune.sh
```

## Inference Demos
### Single-Sample Inference (CLI)
Use the lightweight helper to test a single image-question pair. This script loads the model once and prints the response directly in the terminal.

```bash
CUDA_VISIBLE_DEVICES=0 python scripts/run_mfrsvlm_inference.py \
  --model-path checkpoints/mfrsvlm-7b_sft \
  --image-path /path/to/image.png \
  --prompt "What is shown in the image?"
```


### Web Demo (Full-Model UI)
Start a simple Flask web interface for interactive evaluation. The server loads the checkpoint once, then serves a browser UI for repeated queries.

```bash
CUDA_VISIBLE_DEVICES=0 python scripts/run_mf-rsvlm_web_server.py \
  --model-path checkpoints/mfrsvlm-7b_sft \
  --host 0.0.0.0 \
  --port 7860
```

Open `http://localhost:7860` in your browser, upload an image, and enter a question to get the model response.

**Web UI Result**
![Web UI Result](asserts/result.png)

## Evaluation
We provide a dedicated evaluation toolkit: [RSEvalKit](https://github.com/fitzpchao/RSEvalKit).

```bash
git clone https://github.com/fitzpchao/RSEvalKit
cd RSEvalKit
conda create -n rseval
conda activate rseval
pip install -r requirements.txt
```

Download the [model weights and datasets](#downloads), then follow the RSEvalKit README for one-click evaluation.


## Citation
```bibtex
@article{dang2025fuse,
  title={FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing},
  author={Dang, Yunkai and Wang, Donghao and Yang, Jiacheng and Jiang, Yifan and Zhu, Meiyi and Yang, Yuekun and Wang, Cong and Fan, Qi and Li, Wenbin and Gao, Yang},
  journal={arXiv preprint arXiv:2512.24022},
  year={2025}
}
```

## Acknowledgement
We gratefully acknowledge these wonderful works:
- [Vicuna](https://github.com/lm-sys/FastChat#vicuna-weights)
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V)
- [LLaMA](https://github.com/facebookresearch/llama)
- [VHM](https://github.com/opendatalab/VHM)