File size: 5,726 Bytes
9617dfc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
license: apache-2.0
---
<h1 align="center">MF-RSVLM</h1>
<p align="center">
<strong>FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing</strong>
</p>
<p align="center">
<a href="https://arxiv.org/abs/2512.24022" target="_blank">
<img src="https://img.shields.io/badge/arXiv-2512.24022-B31B1B.svg" alt="arXiv Badge"/>
</a>
<a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft" target="_blank">
<img src="https://img.shields.io/badge/HuggingFace-Model-yellow" alt="Hugging Face Model"/>
</a>
<a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT" target="_blank">
<img src="https://img.shields.io/badge/HuggingFace-Dataset-yellow" alt="Hugging Face Dataset"/>
</a>
<img src="https://komarev.com/ghpvc/?username=Yunkaidang&color=blue" alt="GitHub Views"/>
</p>
<p align="center">
<a href="https://github.com/Yunkaidang/RSVLM">Project Page</a> |
<a href="https://arxiv.org/abs/2512.24022">Paper</a> |
<a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft">Model</a> |
<a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT">Dataset</a>
</p>
> If this project helps you, please give us a star on GitHub.
## Overview
MF-RSVLM is a remote sensing vision-language model (VLM). It combines a CLIP vision encoder, a two-layer MLP projector, and a Vicuna-7B LLM, and is trained in two stages for modality alignment and instruction following.
- Visual Encoder: CLIP ViT-L/14 336px
- Projector: 2-layer MLP
- LLM: Vicuna-7B v1.5
- Training: Pretrain (VersaD 1.4M image-text pairs) + SFT (instruction tuning)
## Contents
- [Install](#install)
- [Repository Layout](#repository-layout)
- [Downloads](#downloads)
- [Training](#training)
- [Inference Demos](#inference-demos)
- [Evaluation](#evaluation)
- [Citation](#citation)
## Install
```bash
git clone git@github.com:opendatalab/MF-RSVLM.git
cd MF-RSVLM
conda create -n mf-rsvlm
conda activate mf-rsvlm
pip install -r requirements.txt
```
## Repository Layout
```
MF-RSVLM/
βββ mfrsvlm/ # package code
β βββ model/ # deepstack, builder, consolidate
β βββ train/ # train_mem.py, train.py, trainer
β βββ conversation.py
β βββ constants.py
β βββ mm_utils.py
β βββ utils.py
βββ scripts/ # inference/eval/data-prep helpers + ZeRO configs
β βββ data/
βββ checkpoints/ # mf-rsvlm-7b_pretrained, mf-rsvlm-7b_sft
βββ models/ # vicuna-7b-v1.5, clip-vit-large-patch14-336, llava-mlp2x
βββ requirements.txt
βββ README.md
```
## Downloads
### Models
| Name | Link | Description |
|---|---|---|
| MF-RSVLM Pretrain | https://huggingface.co/FelixKAI/mf_rsvlm_7b_pretrained | Pretrain stage |
| MF-RSVLM SFT | https://huggingface.co/FelixKAI/mfrsvlm-7b_sft | SFT stage|
| CLIP Pretrain | https://huggingface.co/openai/clip-vit-large-patch14-336 | Pretraining stage vision tower |
| Vicuna-7B| https://huggingface.co/lmsys/vicuna-7b-v1.5 | Pretraining stage Language tower |
| LLaVA-1.5 MLP Projector | https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/tree/main | MLP projector weights |
### Datasets
- Pretrain data: https://huggingface.co/datasets/FitzPC/VHM_VersaD
- SFT data: https://huggingface.co/datasets/FelixKAI/RSVLM-SFT
## Training
MF-RSVLM training has two stages: pretraining for modality alignment, and supervised fine-tuning (SFT) for instruction following.
### Pretrain
Run the Slurm script below to start pretraining:
```bash
sh scripts/rs/slurm_pretrain.sh
```
### Supervised Fine-Tuning
Run the Slurm script below to start SFT:
```bash
sh scripts/rs/slurm_finetune.sh
```
## Inference Demos
### Single-Sample Inference (CLI)
Use the lightweight helper to test a single image-question pair. This script loads the model once and prints the response directly in the terminal.
```bash
CUDA_VISIBLE_DEVICES=0 python scripts/run_mfrsvlm_inference.py \
--model-path checkpoints/mfrsvlm-7b_sft \
--image-path /path/to/image.png \
--prompt "What is shown in the image?"
```
### Web Demo (Full-Model UI)
Start a simple Flask web interface for interactive evaluation. The server loads the checkpoint once, then serves a browser UI for repeated queries.
```bash
CUDA_VISIBLE_DEVICES=0 python scripts/run_mf-rsvlm_web_server.py \
--model-path checkpoints/mfrsvlm-7b_sft \
--host 0.0.0.0 \
--port 7860
```
Open `http://localhost:7860` in your browser, upload an image, and enter a question to get the model response.
**Web UI Result**

## Evaluation
We provide a dedicated evaluation toolkit: [RSEvalKit](https://github.com/fitzpchao/RSEvalKit).
```bash
git clone https://github.com/fitzpchao/RSEvalKit
cd RSEvalKit
conda create -n rseval
conda activate rseval
pip install -r requirements.txt
```
Download the [model weights and datasets](#downloads), then follow the RSEvalKit README for one-click evaluation.
## Citation
```bibtex
@article{dang2025fuse,
title={FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing},
author={Dang, Yunkai and Wang, Donghao and Yang, Jiacheng and Jiang, Yifan and Zhu, Meiyi and Yang, Yuekun and Wang, Cong and Fan, Qi and Li, Wenbin and Gao, Yang},
journal={arXiv preprint arXiv:2512.24022},
year={2025}
}
```
## Acknowledgement
We gratefully acknowledge these wonderful works:
- [Vicuna](https://github.com/lm-sys/FastChat#vicuna-weights)
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V)
- [LLaMA](https://github.com/facebookresearch/llama)
- [VHM](https://github.com/opendatalab/VHM)
|