FelixKAI commited on
Commit
9617dfc
Β·
verified Β·
1 Parent(s): 987cf63

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -3
README.md CHANGED
@@ -1,3 +1,165 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+ <h1 align="center">MF-RSVLM</h1>
7
+ <p align="center">
8
+ <strong>FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing</strong>
9
+ </p>
10
+
11
+ <p align="center">
12
+ <a href="https://arxiv.org/abs/2512.24022" target="_blank">
13
+ <img src="https://img.shields.io/badge/arXiv-2512.24022-B31B1B.svg" alt="arXiv Badge"/>
14
+ </a>
15
+ <a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft" target="_blank">
16
+ <img src="https://img.shields.io/badge/HuggingFace-Model-yellow" alt="Hugging Face Model"/>
17
+ </a>
18
+ <a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT" target="_blank">
19
+ <img src="https://img.shields.io/badge/HuggingFace-Dataset-yellow" alt="Hugging Face Dataset"/>
20
+ </a>
21
+ <img src="https://komarev.com/ghpvc/?username=Yunkaidang&color=blue" alt="GitHub Views"/>
22
+ </p>
23
+
24
+ <p align="center">
25
+ <a href="https://github.com/Yunkaidang/RSVLM">Project Page</a> |
26
+ <a href="https://arxiv.org/abs/2512.24022">Paper</a> |
27
+ <a href="https://huggingface.co/FelixKAI/mfrsvlm-7b_sft">Model</a> |
28
+ <a href="https://huggingface.co/datasets/FelixKAI/RSVLM-SFT">Dataset</a>
29
+ </p>
30
+
31
+ > If this project helps you, please give us a star on GitHub.
32
+
33
+ ## Overview
34
+ MF-RSVLM is a remote sensing vision-language model (VLM). It combines a CLIP vision encoder, a two-layer MLP projector, and a Vicuna-7B LLM, and is trained in two stages for modality alignment and instruction following.
35
+
36
+ - Visual Encoder: CLIP ViT-L/14 336px
37
+ - Projector: 2-layer MLP
38
+ - LLM: Vicuna-7B v1.5
39
+ - Training: Pretrain (VersaD 1.4M image-text pairs) + SFT (instruction tuning)
40
+
41
+ ## Contents
42
+ - [Install](#install)
43
+ - [Repository Layout](#repository-layout)
44
+ - [Downloads](#downloads)
45
+ - [Training](#training)
46
+ - [Inference Demos](#inference-demos)
47
+ - [Evaluation](#evaluation)
48
+ - [Citation](#citation)
49
+
50
+
51
+ ## Install
52
+ ```bash
53
+ git clone git@github.com:opendatalab/MF-RSVLM.git
54
+ cd MF-RSVLM
55
+ conda create -n mf-rsvlm
56
+ conda activate mf-rsvlm
57
+ pip install -r requirements.txt
58
+ ```
59
+
60
+ ## Repository Layout
61
+ ```
62
+ MF-RSVLM/
63
+ β”œβ”€β”€ mfrsvlm/ # package code
64
+ β”‚ β”œβ”€β”€ model/ # deepstack, builder, consolidate
65
+ β”‚ β”œβ”€β”€ train/ # train_mem.py, train.py, trainer
66
+ β”‚ β”œβ”€β”€ conversation.py
67
+ β”‚ β”œβ”€β”€ constants.py
68
+ β”‚ β”œβ”€β”€ mm_utils.py
69
+ β”‚ └── utils.py
70
+ β”œβ”€β”€ scripts/ # inference/eval/data-prep helpers + ZeRO configs
71
+ β”‚ └── data/
72
+ β”œβ”€β”€ checkpoints/ # mf-rsvlm-7b_pretrained, mf-rsvlm-7b_sft
73
+ β”œβ”€β”€ models/ # vicuna-7b-v1.5, clip-vit-large-patch14-336, llava-mlp2x
74
+ β”œβ”€β”€ requirements.txt
75
+ └── README.md
76
+ ```
77
+
78
+ ## Downloads
79
+ ### Models
80
+ | Name | Link | Description |
81
+ |---|---|---|
82
+ | MF-RSVLM Pretrain | https://huggingface.co/FelixKAI/mf_rsvlm_7b_pretrained | Pretrain stage |
83
+ | MF-RSVLM SFT | https://huggingface.co/FelixKAI/mfrsvlm-7b_sft | SFT stage|
84
+ | CLIP Pretrain | https://huggingface.co/openai/clip-vit-large-patch14-336 | Pretraining stage vision tower |
85
+ | Vicuna-7B| https://huggingface.co/lmsys/vicuna-7b-v1.5 | Pretraining stage Language tower |
86
+ | LLaVA-1.5 MLP Projector | https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/tree/main | MLP projector weights |
87
+
88
+ ### Datasets
89
+ - Pretrain data: https://huggingface.co/datasets/FitzPC/VHM_VersaD
90
+ - SFT data: https://huggingface.co/datasets/FelixKAI/RSVLM-SFT
91
+
92
+
93
+ ## Training
94
+ MF-RSVLM training has two stages: pretraining for modality alignment, and supervised fine-tuning (SFT) for instruction following.
95
+
96
+ ### Pretrain
97
+ Run the Slurm script below to start pretraining:
98
+ ```bash
99
+ sh scripts/rs/slurm_pretrain.sh
100
+ ```
101
+
102
+ ### Supervised Fine-Tuning
103
+ Run the Slurm script below to start SFT:
104
+ ```bash
105
+ sh scripts/rs/slurm_finetune.sh
106
+ ```
107
+
108
+ ## Inference Demos
109
+ ### Single-Sample Inference (CLI)
110
+ Use the lightweight helper to test a single image-question pair. This script loads the model once and prints the response directly in the terminal.
111
+
112
+ ```bash
113
+ CUDA_VISIBLE_DEVICES=0 python scripts/run_mfrsvlm_inference.py \
114
+ --model-path checkpoints/mfrsvlm-7b_sft \
115
+ --image-path /path/to/image.png \
116
+ --prompt "What is shown in the image?"
117
+ ```
118
+
119
+
120
+ ### Web Demo (Full-Model UI)
121
+ Start a simple Flask web interface for interactive evaluation. The server loads the checkpoint once, then serves a browser UI for repeated queries.
122
+
123
+ ```bash
124
+ CUDA_VISIBLE_DEVICES=0 python scripts/run_mf-rsvlm_web_server.py \
125
+ --model-path checkpoints/mfrsvlm-7b_sft \
126
+ --host 0.0.0.0 \
127
+ --port 7860
128
+ ```
129
+
130
+ Open `http://localhost:7860` in your browser, upload an image, and enter a question to get the model response.
131
+
132
+ **Web UI Result**
133
+ ![Web UI Result](asserts/result.png)
134
+
135
+ ## Evaluation
136
+ We provide a dedicated evaluation toolkit: [RSEvalKit](https://github.com/fitzpchao/RSEvalKit).
137
+
138
+ ```bash
139
+ git clone https://github.com/fitzpchao/RSEvalKit
140
+ cd RSEvalKit
141
+ conda create -n rseval
142
+ conda activate rseval
143
+ pip install -r requirements.txt
144
+ ```
145
+
146
+ Download the [model weights and datasets](#downloads), then follow the RSEvalKit README for one-click evaluation.
147
+
148
+
149
+ ## Citation
150
+ ```bibtex
151
+ @article{dang2025fuse,
152
+ title={FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing},
153
+ author={Dang, Yunkai and Wang, Donghao and Yang, Jiacheng and Jiang, Yifan and Zhu, Meiyi and Yang, Yuekun and Wang, Cong and Fan, Qi and Li, Wenbin and Gao, Yang},
154
+ journal={arXiv preprint arXiv:2512.24022},
155
+ year={2025}
156
+ }
157
+ ```
158
+
159
+ ## Acknowledgement
160
+ We gratefully acknowledge these wonderful works:
161
+ - [Vicuna](https://github.com/lm-sys/FastChat#vicuna-weights)
162
+ - [LLaVA](https://github.com/haotian-liu/LLaVA)
163
+ - [ShareGPT4V](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V)
164
+ - [LLaMA](https://github.com/facebookresearch/llama)
165
+ - [VHM](https://github.com/opendatalab/VHM)