|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
|
|
|
<h1 align="center">MF-RSVLM</h1> |
|
|
<p align="center"> |
|
|
<strong>FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing</strong> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://arxiv.org/abs/2512.24022" target="_blank"> |
|
|
<img src="https://img.shields.io/badge/arXiv-2512.24022-B31B1B.svg" alt="arXiv Badge"/> |
|
|
</a> |
|
|
<a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft" target="_blank"> |
|
|
<img src="https://img.shields.io/badge/HuggingFace-Model-yellow" alt="Hugging Face Model"/> |
|
|
</a> |
|
|
<a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT" target="_blank"> |
|
|
<img src="https://img.shields.io/badge/HuggingFace-Dataset-yellow" alt="Hugging Face Dataset"/> |
|
|
</a> |
|
|
<img src="https://komarev.com/ghpvc/?username=Yunkaidang&color=blue" alt="GitHub Views"/> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://github.com/Yunkaidang/RSVLM">Project Page</a> | |
|
|
<a href="https://arxiv.org/abs/2512.24022">Paper</a> | |
|
|
<a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft">Model</a> | |
|
|
<a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT">Dataset</a> |
|
|
</p> |
|
|
|
|
|
> If this project helps you, please give us a star on GitHub. |
|
|
|
|
|
## Overview |
|
|
MF-RSVLM is a remote sensing vision-language model (VLM). It combines a CLIP vision encoder, a two-layer MLP projector, and a Vicuna-7B LLM, and is trained in two stages for modality alignment and instruction following. |
|
|
|
|
|
- Visual Encoder: CLIP ViT-L/14 336px |
|
|
- Projector: 2-layer MLP |
|
|
- LLM: Vicuna-7B v1.5 |
|
|
- Training: Pretrain (VersaD 1.4M image-text pairs) + SFT (instruction tuning) |
|
|
|
|
|
## Contents |
|
|
- [Install](#install) |
|
|
- [Repository Layout](#repository-layout) |
|
|
- [Downloads](#downloads) |
|
|
- [Training](#training) |
|
|
- [Inference Demos](#inference-demos) |
|
|
- [Evaluation](#evaluation) |
|
|
- [Citation](#citation) |
|
|
|
|
|
|
|
|
## Install |
|
|
```bash |
|
|
git clone git@github.com:opendatalab/MF-RSVLM.git |
|
|
cd MF-RSVLM |
|
|
conda create -n mf-rsvlm |
|
|
conda activate mf-rsvlm |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
## Repository Layout |
|
|
``` |
|
|
MF-RSVLM/ |
|
|
βββ mfrsvlm/ # package code |
|
|
β βββ model/ # deepstack, builder, consolidate |
|
|
β βββ train/ # train_mem.py, train.py, trainer |
|
|
β βββ conversation.py |
|
|
β βββ constants.py |
|
|
β βββ mm_utils.py |
|
|
β βββ utils.py |
|
|
βββ scripts/ # inference/eval/data-prep helpers + ZeRO configs |
|
|
β βββ data/ |
|
|
βββ checkpoints/ # mf-rsvlm-7b_pretrained, mf-rsvlm-7b_sft |
|
|
βββ models/ # vicuna-7b-v1.5, clip-vit-large-patch14-336, llava-mlp2x |
|
|
βββ requirements.txt |
|
|
βββ README.md |
|
|
``` |
|
|
|
|
|
## Downloads |
|
|
### Models |
|
|
| Name | Link | Description | |
|
|
|---|---|---| |
|
|
| MF-RSVLM Pretrain | https://huggingface.co/FelixKAI/mf_rsvlm_7b_pretrained | Pretrain stage | |
|
|
| MF-RSVLM SFT | https://huggingface.co/FelixKAI/mfrsvlm-7b_sft | SFT stage| |
|
|
| CLIP Pretrain | https://huggingface.co/openai/clip-vit-large-patch14-336 | Pretraining stage vision tower | |
|
|
| Vicuna-7B| https://huggingface.co/lmsys/vicuna-7b-v1.5 | Pretraining stage Language tower | |
|
|
| LLaVA-1.5 MLP Projector | https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/tree/main | MLP projector weights | |
|
|
|
|
|
### Datasets |
|
|
- Pretrain data: https://huggingface.co/datasets/FitzPC/VHM_VersaD |
|
|
- SFT data: https://huggingface.co/datasets/FelixKAI/RSVLM-SFT |
|
|
|
|
|
|
|
|
## Training |
|
|
MF-RSVLM training has two stages: pretraining for modality alignment, and supervised fine-tuning (SFT) for instruction following. |
|
|
|
|
|
### Pretrain |
|
|
Run the Slurm script below to start pretraining: |
|
|
```bash |
|
|
sh scripts/rs/slurm_pretrain.sh |
|
|
``` |
|
|
|
|
|
### Supervised Fine-Tuning |
|
|
Run the Slurm script below to start SFT: |
|
|
```bash |
|
|
sh scripts/rs/slurm_finetune.sh |
|
|
``` |
|
|
|
|
|
## Inference Demos |
|
|
### Single-Sample Inference (CLI) |
|
|
Use the lightweight helper to test a single image-question pair. This script loads the model once and prints the response directly in the terminal. |
|
|
|
|
|
```bash |
|
|
CUDA_VISIBLE_DEVICES=0 python scripts/run_mfrsvlm_inference.py \ |
|
|
--model-path checkpoints/mfrsvlm-7b_sft \ |
|
|
--image-path /path/to/image.png \ |
|
|
--prompt "What is shown in the image?" |
|
|
``` |
|
|
|
|
|
|
|
|
### Web Demo (Full-Model UI) |
|
|
Start a simple Flask web interface for interactive evaluation. The server loads the checkpoint once, then serves a browser UI for repeated queries. |
|
|
|
|
|
```bash |
|
|
CUDA_VISIBLE_DEVICES=0 python scripts/run_mf-rsvlm_web_server.py \ |
|
|
--model-path checkpoints/mfrsvlm-7b_sft \ |
|
|
--host 0.0.0.0 \ |
|
|
--port 7860 |
|
|
``` |
|
|
|
|
|
Open `http://localhost:7860` in your browser, upload an image, and enter a question to get the model response. |
|
|
|
|
|
**Web UI Result** |
|
|
 |
|
|
|
|
|
## Evaluation |
|
|
We provide a dedicated evaluation toolkit: [RSEvalKit](https://github.com/fitzpchao/RSEvalKit). |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/fitzpchao/RSEvalKit |
|
|
cd RSEvalKit |
|
|
conda create -n rseval |
|
|
conda activate rseval |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
Download the [model weights and datasets](#downloads), then follow the RSEvalKit README for one-click evaluation. |
|
|
|
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@article{dang2025fuse, |
|
|
title={FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing}, |
|
|
author={Dang, Yunkai and Wang, Donghao and Yang, Jiacheng and Jiang, Yifan and Zhu, Meiyi and Yang, Yuekun and Wang, Cong and Fan, Qi and Li, Wenbin and Gao, Yang}, |
|
|
journal={arXiv preprint arXiv:2512.24022}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgement |
|
|
We gratefully acknowledge these wonderful works: |
|
|
- [Vicuna](https://github.com/lm-sys/FastChat#vicuna-weights) |
|
|
- [LLaVA](https://github.com/haotian-liu/LLaVA) |
|
|
- [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V) |
|
|
- [LLaMA](https://github.com/facebookresearch/llama) |
|
|
- [VHM](https://github.com/opendatalab/VHM) |
|
|
|