FelixKAI
/

mfrsvlm-7b_sft

+<h1 align="center">MF-RSVLM</h1>
+<p align="center">
+  <strong>FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing</strong>
+</p>
+<p align="center">
+  <a href="https://arxiv.org/abs/2512.24022" target="_blank">
+    <img src="https://img.shields.io/badge/arXiv-2512.24022-B31B1B.svg" alt="arXiv Badge"/>
+  </a>
+  <a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft" target="_blank">
+    <img src="https://img.shields.io/badge/HuggingFace-Model-yellow" alt="Hugging Face Model"/>
+  </a>
+  <a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT" target="_blank">
+    <img src="https://img.shields.io/badge/HuggingFace-Dataset-yellow" alt="Hugging Face Dataset"/>
+  </a>
+  <img src="https://komarev.com/ghpvc/?username=Yunkaidang&color=blue" alt="GitHub Views"/>
+</p>
+<p align="center">
+  <a href="https://github.com/Yunkaidang/RSVLM">Project Page</a> |
+  <a href="https://arxiv.org/abs/2512.24022">Paper</a> |
+  <a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft">Model</a> |
+  <a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT">Dataset</a>
+</p>
+> If this project helps you, please give us a star on GitHub.
+## Overview
+MF-RSVLM is a remote sensing vision-language model (VLM). It combines a CLIP vision encoder, a two-layer MLP projector, and a Vicuna-7B LLM, and is trained in two stages for modality alignment and instruction following.
+- Visual Encoder: CLIP ViT-L/14 336px
+- Projector: 2-layer MLP
+- LLM: Vicuna-7B v1.5
+- Training: Pretrain (VersaD 1.4M image-text pairs) + SFT (instruction tuning)
+## Contents
+- [Install](#install)
+- [Repository Layout](#repository-layout)
+- [Downloads](#downloads)
+- [Training](#training)
+- [Inference Demos](#inference-demos)
+- [Evaluation](#evaluation)
+- [Citation](#citation)
+## Install
+```bash
+git clone git@github.com:opendatalab/MF-RSVLM.git
+cd MF-RSVLM
+conda create -n mf-rsvlm
+conda activate mf-rsvlm
+pip install -r requirements.txt
+```
+## Repository Layout
+```
+MF-RSVLM/
+├── mfrsvlm/               # package code
+│   ├── model/             # deepstack, builder, consolidate
+│   ├── train/             # train_mem.py, train.py, trainer
+│   ├── conversation.py
+│   ├── constants.py
+│   ├── mm_utils.py
+│   └── utils.py
+├── scripts/               # inference/eval/data-prep helpers + ZeRO configs
+│   └── data/
+├── checkpoints/           # mf-rsvlm-7b_pretrained, mf-rsvlm-7b_sft
+├── models/                # vicuna-7b-v1.5, clip-vit-large-patch14-336, llava-mlp2x
+├── requirements.txt
+└── README.md
+```
+## Downloads
+### Models
+| Name | Link | Description |
+|---|---|---|
+| MF-RSVLM Pretrain | https://huggingface.co/FelixKAI/mf_rsvlm_7b_pretrained | Pretrain stage |
+| MF-RSVLM SFT | https://huggingface.co/FelixKAI/mfrsvlm-7b_sft | SFT stage|
+| CLIP Pretrain | https://huggingface.co/openai/clip-vit-large-patch14-336 | Pretraining stage vision tower |
+| Vicuna-7B| https://huggingface.co/lmsys/vicuna-7b-v1.5 | Pretraining stage Language tower |
+| LLaVA-1.5 MLP Projector | https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/tree/main | MLP projector weights |
+### Datasets
+- Pretrain data: https://huggingface.co/datasets/FitzPC/VHM_VersaD
+- SFT data: https://huggingface.co/datasets/FelixKAI/RSVLM-SFT
+## Training
+MF-RSVLM training has two stages: pretraining for modality alignment, and supervised fine-tuning (SFT) for instruction following.
+### Pretrain
+Run the Slurm script below to start pretraining:
+```bash
+sh scripts/rs/slurm_pretrain.sh
+```
+### Supervised Fine-Tuning
+Run the Slurm script below to start SFT:
+```bash
+sh scripts/rs/slurm_finetune.sh
+```
+## Inference Demos
+### Single-Sample Inference (CLI)
+Use the lightweight helper to test a single image-question pair. This script loads the model once and prints the response directly in the terminal.
+```bash
+CUDA_VISIBLE_DEVICES=0 python scripts/run_mfrsvlm_inference.py \
+  --model-path checkpoints/mfrsvlm-7b_sft \
+  --image-path /path/to/image.png \
+  --prompt "What is shown in the image?"
+```
+### Web Demo (Full-Model UI)
+Start a simple Flask web interface for interactive evaluation. The server loads the checkpoint once, then serves a browser UI for repeated queries.
+```bash
+CUDA_VISIBLE_DEVICES=0 python scripts/run_mf-rsvlm_web_server.py \
+  --model-path checkpoints/mfrsvlm-7b_sft \
+  --host 0.0.0.0 \
+  --port 7860
+```
+Open `http://localhost:7860` in your browser, upload an image, and enter a question to get the model response.
+**Web UI Result**
+![Web UI Result](asserts/result.png)
+## Evaluation
+We provide a dedicated evaluation toolkit: [RSEvalKit](https://github.com/fitzpchao/RSEvalKit).
+```bash
+git clone https://github.com/fitzpchao/RSEvalKit
+cd RSEvalKit
+conda create -n rseval
+conda activate rseval
+pip install -r requirements.txt
+```
+Download the [model weights and datasets](#downloads), then follow the RSEvalKit README for one-click evaluation.
+## Citation
+```bibtex
+@article{dang2025fuse,
+  title={FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing},
+  author={Dang, Yunkai and Wang, Donghao and Yang, Jiacheng and Jiang, Yifan and Zhu, Meiyi and Yang, Yuekun and Wang, Cong and Fan, Qi and Li, Wenbin and Gao, Yang},
+  journal={arXiv preprint arXiv:2512.24022},
+  year={2025}
+}
+```
+## Acknowledgement
+We gratefully acknowledge these wonderful works:
+- [Vicuna](https://github.com/lm-sys/FastChat#vicuna-weights)
+- [LLaVA](https://github.com/haotian-liu/LLaVA)
+- [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V)
+- [LLaMA](https://github.com/facebookresearch/llama)
+- [VHM](https://github.com/opendatalab/VHM)