FelixKAI commited on
Commit
470be76
Β·
verified Β·
1 Parent(s): 45a0236

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -3
README.md CHANGED
@@ -1,3 +1,163 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ <h1 align="center">MF-RSVLM</h1>
5
+ <p align="center">
6
+ <strong>FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing</strong>
7
+ </p>
8
+
9
+ <p align="center">
10
+ <a href="https://arxiv.org/abs/2512.24022" target="_blank">
11
+ <img src="https://img.shields.io/badge/arXiv-2512.24022-B31B1B.svg" alt="arXiv Badge"/>
12
+ </a>
13
+ <a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft" target="_blank">
14
+ <img src="https://img.shields.io/badge/HuggingFace-Model-yellow" alt="Hugging Face Model"/>
15
+ </a>
16
+ <a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT" target="_blank">
17
+ <img src="https://img.shields.io/badge/HuggingFace-Dataset-yellow" alt="Hugging Face Dataset"/>
18
+ </a>
19
+ <img src="https://komarev.com/ghpvc/?username=Yunkaidang&color=blue" alt="GitHub Views"/>
20
+ </p>
21
+
22
+ <p align="center">
23
+ <a href="https://github.com/Yunkaidang/RSVLM">Project Page</a> |
24
+ <a href="https://arxiv.org/abs/2512.24022">Paper</a> |
25
+ <a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft">Model</a> |
26
+ <a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT">Dataset</a>
27
+ </p>
28
+
29
+ > If this project helps you, please give us a star on GitHub.
30
+
31
+ ## Overview
32
+ MF-RSVLM is a remote sensing vision-language model (VLM). It combines a CLIP vision encoder, a two-layer MLP projector, and a Vicuna-7B LLM, and is trained in two stages for modality alignment and instruction following.
33
+
34
+ - Visual Encoder: CLIP ViT-L/14 336px
35
+ - Projector: 2-layer MLP
36
+ - LLM: Vicuna-7B v1.5
37
+ - Training: Pretrain (VersaD 1.4M image-text pairs) + SFT (instruction tuning)
38
+
39
+ ## Contents
40
+ - [Install](#install)
41
+ - [Repository Layout](#repository-layout)
42
+ - [Downloads](#downloads)
43
+ - [Training](#training)
44
+ - [Inference Demos](#inference-demos)
45
+ - [Evaluation](#evaluation)
46
+ - [Citation](#citation)
47
+
48
+
49
+ ## Install
50
+ ```bash
51
+ git clone git@github.com:opendatalab/MF-RSVLM.git
52
+ cd MF-RSVLM
53
+ conda create -n mf-rsvlm
54
+ conda activate mf-rsvlm
55
+ pip install -r requirements.txt
56
+ ```
57
+
58
+ ## Repository Layout
59
+ ```
60
+ MF-RSVLM/
61
+ β”œβ”€β”€ mfrsvlm/ # package code
62
+ β”‚ β”œβ”€β”€ model/ # deepstack, builder, consolidate
63
+ β”‚ β”œβ”€β”€ train/ # train_mem.py, train.py, trainer
64
+ β”‚ β”œβ”€β”€ conversation.py
65
+ β”‚ β”œβ”€β”€ constants.py
66
+ β”‚ β”œβ”€β”€ mm_utils.py
67
+ β”‚ └── utils.py
68
+ β”œβ”€β”€ scripts/ # inference/eval/data-prep helpers + ZeRO configs
69
+ β”‚ └── data/
70
+ β”œβ”€β”€ checkpoints/ # mf-rsvlm-7b_pretrained, mf-rsvlm-7b_sft
71
+ β”œβ”€β”€ models/ # vicuna-7b-v1.5, clip-vit-large-patch14-336, llava-mlp2x
72
+ β”œβ”€β”€ requirements.txt
73
+ └── README.md
74
+ ```
75
+
76
+ ## Downloads
77
+ ### Models
78
+ | Name | Link | Description |
79
+ |---|---|---|
80
+ | MF-RSVLM Pretrain | https://huggingface.co/FelixKAI/mf_rsvlm_7b_pretrained | Pretrain stage |
81
+ | MF-RSVLM SFT | https://huggingface.co/FelixKAI/mfrsvlm-7b_sft | SFT stage|
82
+ | CLIP Pretrain | https://huggingface.co/openai/clip-vit-large-patch14-336 | Pretraining stage vision tower |
83
+ | Vicuna-7B| https://huggingface.co/lmsys/vicuna-7b-v1.5 | Pretraining stage Language tower |
84
+ | LLaVA-1.5 MLP Projector | https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/tree/main | MLP projector weights |
85
+
86
+ ### Datasets
87
+ - Pretrain data: https://huggingface.co/datasets/FitzPC/VHM_VersaD
88
+ - SFT data: https://huggingface.co/datasets/FelixKAI/RSVLM-SFT
89
+
90
+
91
+ ## Training
92
+ MF-RSVLM training has two stages: pretraining for modality alignment, and supervised fine-tuning (SFT) for instruction following.
93
+
94
+ ### Pretrain
95
+ Run the Slurm script below to start pretraining:
96
+ ```bash
97
+ sh scripts/rs/slurm_pretrain.sh
98
+ ```
99
+
100
+ ### Supervised Fine-Tuning
101
+ Run the Slurm script below to start SFT:
102
+ ```bash
103
+ sh scripts/rs/slurm_finetune.sh
104
+ ```
105
+
106
+ ## Inference Demos
107
+ ### Single-Sample Inference (CLI)
108
+ Use the lightweight helper to test a single image-question pair. This script loads the model once and prints the response directly in the terminal.
109
+
110
+ ```bash
111
+ CUDA_VISIBLE_DEVICES=0 python scripts/run_mfrsvlm_inference.py \
112
+ --model-path checkpoints/mfrsvlm-7b_sft \
113
+ --image-path /path/to/image.png \
114
+ --prompt "What is shown in the image?"
115
+ ```
116
+
117
+
118
+ ### Web Demo (Full-Model UI)
119
+ Start a simple Flask web interface for interactive evaluation. The server loads the checkpoint once, then serves a browser UI for repeated queries.
120
+
121
+ ```bash
122
+ CUDA_VISIBLE_DEVICES=0 python scripts/run_mf-rsvlm_web_server.py \
123
+ --model-path checkpoints/mfrsvlm-7b_sft \
124
+ --host 0.0.0.0 \
125
+ --port 7860
126
+ ```
127
+
128
+ Open `http://localhost:7860` in your browser, upload an image, and enter a question to get the model response.
129
+
130
+ **Web UI Result**
131
+ ![Web UI Result](asserts/result.png)
132
+
133
+ ## Evaluation
134
+ We provide a dedicated evaluation toolkit: [RSEvalKit](https://github.com/fitzpchao/RSEvalKit).
135
+
136
+ ```bash
137
+ git clone https://github.com/fitzpchao/RSEvalKit
138
+ cd RSEvalKit
139
+ conda create -n rseval
140
+ conda activate rseval
141
+ pip install -r requirements.txt
142
+ ```
143
+
144
+ Download the [model weights and datasets](#downloads), then follow the RSEvalKit README for one-click evaluation.
145
+
146
+
147
+ ## Citation
148
+ ```bibtex
149
+ @article{dang2025fuse,
150
+ title={FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing},
151
+ author={Dang, Yunkai and Wang, Donghao and Yang, Jiacheng and Jiang, Yifan and Zhu, Meiyi and Yang, Yuekun and Wang, Cong and Fan, Qi and Li, Wenbin and Gao, Yang},
152
+ journal={arXiv preprint arXiv:2512.24022},
153
+ year={2025}
154
+ }
155
+ ```
156
+
157
+ ## Acknowledgement
158
+ We gratefully acknowledge these wonderful works:
159
+ - [Vicuna](https://github.com/lm-sys/FastChat#vicuna-weights)
160
+ - [LLaVA](https://github.com/haotian-liu/LLaVA)
161
+ - [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V)
162
+ - [LLaMA](https://github.com/facebookresearch/llama)
163
+ - [VHM](https://github.com/opendatalab/VHM)