Improve model card: Add pipeline tag, library name, paper, code, and project page links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +151 -120
README.md CHANGED
@@ -1,120 +1,151 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
4
-
5
- # RSCCM: Remote Sensing Change Captioning Model
6
-
7
- ## Overview
8
-
9
- RSCCM is a supervised full-tuning version of [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) that specializes for remote sensing change captioning, which is trained on [RSCC](https://huggingface.co/datasets/BiliSakura/RSCC) dataset. The training details is shown in our [paper](https://github.com/Bili-Sakura/RSCC).
10
-
11
- ## Installation
12
-
13
- Follow Qwen2.5-VL official huggingface repo (see [here](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)).
14
-
15
- ```bash
16
- pip install transformers accelerate # the latest stable version already integrate Qwen2.5-VL
17
- pip install qwen-vl-utils[decord]==0.0.8
18
- ```
19
-
20
- ## Inference
21
-
22
- For more implement details, refer to Qwen-VL official GitHub repo (see [here](https://github.com/QwenLM/Qwen2.5-VL)).
23
-
24
- 1. Load model (the same as Qwen2.5-VL)
25
-
26
- ```python
27
- from transformers import (
28
- Qwen2_5_VLForConditionalGeneration,
29
- AutoProcessor
30
- )
31
- import torch
32
- model_id = "BiliSakura/RSCCM"
33
- model_path = model_id # download from huggingface.co automatically or you can specify as path/to/your/model/folder
34
- model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
35
- model_path,
36
- torch_dtype=torch.bfloat16,
37
- attn_implementation="flash_attention_2",
38
- ).to("cuda")
39
- processor = AutoProcessor.from_pretrained(model_path)
40
- ```
41
-
42
- 2. Get image pairs
43
-
44
- ```python
45
- from PIL import image
46
- pre_img_path = "path/to/pre/event/image"
47
- post_img_path = "path/to/post/event/image"
48
- text_prompt ="""
49
- Give change description between two satellite images.
50
- Output answer in a news style with a few sentences using precise phrases separated by commas.
51
- """
52
- pre_image = Image.open(pre_img_path)
53
- post_image = Image.open(post_img_path)
54
- ```
55
-
56
- 3. Inference
57
-
58
-
59
- ```python
60
- from qwen_vl_utils import process_vision_info
61
- import torch
62
- messages = [
63
- {
64
- "role": "user",
65
- "content": [
66
- {"type": "image", "image": pre_image},
67
- {"type": "image", "image": post_image},
68
- {
69
- "type": "text",
70
- "text": text_prompt,
71
- },
72
- ],
73
- }
74
- ]
75
-
76
- text = processor.apply_chat_template(
77
- messages, tokenize=False, add_generation_prompt=True
78
- )
79
- image_inputs, _ = process_vision_info(messages)
80
- inputs = processor(
81
- text=[text], images=image_inputs, padding=True, return_tensors="pt"
82
- ).to("cuda", torch.bfloat16)
83
- # Generate captions for the input image pair
84
- generated_ids = model.generate(
85
- **inputs,
86
- max_new_tokens=512,
87
- # temperature=TEMPERATURE
88
- )
89
- generated_ids_trimmed = [
90
- out_ids[len(in_ids) :]
91
- for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
92
- ]
93
- captions = processor.batch_decode(
94
- generated_ids_trimmed,
95
- skip_special_tokens=True,
96
- clean_up_tokenization_spaces=False,
97
- )
98
- change_caption = captions[0]
99
- ```
100
-
101
-
102
- ## πŸ“œ Citation
103
-
104
- ```bibtex
105
- @article{rscc_chen_2025,
106
- title={RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events},
107
- author={Zhenyuan Chen},
108
- year={2025},
109
- howpublished={\url{https://github.com/Bili-Sakura/RSCC}}
110
- }
111
- @article{qwen2.5vl,
112
- title={Qwen2.5-VL Technical Report},
113
- url={http://arxiv.org/abs/2502.13923},
114
- DOI={10.48550/arXiv.2502.13923},
115
- author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
116
- year={2025},
117
- month=feb
118
- }
119
-
120
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ pipeline_tag: image-to-text
4
+ library_name: transformers
5
+ tags:
6
+ - remote-sensing
7
+ - change-detection
8
+ - image-captioning
9
+ ---
10
+
11
+ # RSCCM: Remote Sensing Change Captioning Model
12
+
13
+ This model (`RSCCM`) is presented in the paper [RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events](https://huggingface.co/papers/2509.01907).
14
+
15
+ - πŸ“„ [Paper](https://huggingface.co/papers/2509.01907)
16
+ - 🌐 [Project Page](https://bili-sakura.github.io/RSCC/)
17
+ - πŸ’» [Code on GitHub](https://github.com/Bili-Sakura/RSCC)
18
+
19
+ ## Overview
20
+
21
+ RSCCM is a supervised full-tuning version of [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) that specializes for remote sensing change captioning, which is trained on [RSCC](https://huggingface.co/datasets/BiliSakura/RSCC) dataset. The training details are shown in our [paper](https://huggingface.co/papers/2509.01907).
22
+
23
+ ## Installation
24
+
25
+ Follow Qwen2.5-VL official huggingface repo (see [here](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)).
26
+
27
+ ```bash
28
+ pip install transformers accelerate # the latest stable version already integrate Qwen2.5-VL
29
+ pip install qwen-vl-utils[decord]==0.0.8
30
+ ```
31
+
32
+ ## Inference
33
+
34
+ For more implement details, refer to Qwen-VL official GitHub repo (see [here](https://github.com/QwenLM/Qwen2.5-VL)).
35
+
36
+ 1. Load model (the same as Qwen2.5-VL)
37
+
38
+ ```python
39
+ from transformers import (
40
+ Qwen2_5_VLForConditionalGeneration,
41
+ AutoProcessor
42
+ )
43
+ import torch
44
+ model_id = "BiliSakura/RSCCM"
45
+ model_path = model_id # download from huggingface.co automatically or you can specify as path/to/your/model/folder
46
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
47
+ model_path,
48
+ torch_dtype=torch.bfloat16,
49
+ attn_implementation="flash_attention_2",
50
+ ).to("cuda")
51
+ processor = AutoProcessor.from_pretrained(model_path)
52
+ ```
53
+
54
+ 2. Get image pairs
55
+
56
+ ```python
57
+ from PIL import Image
58
+ pre_img_path = "path/to/pre/event/image"
59
+ post_img_path = "path/to/post/event/image"
60
+ text_prompt ="""
61
+ Give change description between two satellite images.
62
+ Output answer in a news style with a few sentences using precise phrases separated by commas.
63
+ """
64
+ pre_image = Image.open(pre_img_path)
65
+ post_image = Image.open(post_img_path)
66
+ ```
67
+
68
+ 3. Inference
69
+
70
+
71
+ ```python
72
+ from qwen_vl_utils import process_vision_info
73
+ import torch
74
+ messages = [
75
+ {
76
+ "role": "user",
77
+ "content": [
78
+ {"type": "image", "image": pre_image},
79
+ {"type": "image", "image": post_image},
80
+ {
81
+ "type": "text",
82
+ "text": text_prompt,
83
+ },
84
+ ],
85
+ }
86
+ ]
87
+
88
+ text = processor.apply_chat_template(
89
+ messages, tokenize=False, add_generation_prompt=True
90
+ )
91
+ image_inputs, _ = process_vision_info(messages)
92
+ inputs = processor(
93
+ text=[text], images=image_inputs, padding=True, return_tensors="pt"
94
+ ).to("cuda", torch.bfloat16)
95
+ # Generate captions for the input image pair
96
+ generated_ids = model.generate(
97
+ **inputs,
98
+ max_new_tokens=512,
99
+ # temperature=TEMPERATURE
100
+ )
101
+ generated_ids_trimmed = [
102
+ out_ids[len(in_ids) :]
103
+ for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
104
+ ]
105
+ captions = processor.batch_decode(
106
+ generated_ids_trimmed,
107
+ skip_special_tokens=True,
108
+ clean_up_tokenization_spaces=False,
109
+ )
110
+ change_caption = captions[0]
111
+ ```
112
+
113
+
114
+ ## πŸ“œ Citation
115
+
116
+ If you find this repository helpful, feel free to cite our paper:
117
+
118
+ ```bibtex
119
+ @misc{chen2025rscclargescaleremotesensing,
120
+ title={RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events},
121
+ author={Zhenyuan Chen and Chenxi Wang and Ningyu Zhang and Feng Zhang},
122
+ year={2025},
123
+ eprint={2509.01907},
124
+ archivePrefix={arXiv},
125
+ primaryClass={cs.CV},
126
+ url={https://arxiv.org/abs/2509.01907},
127
+ }
128
+ @article{qwen2.5vl,
129
+ title={Qwen2.5-VL Technical Report},
130
+ url={http://arxiv.org/abs/2502.13923},
131
+ DOI={10.48550/arXiv.2502.13923},
132
+ author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
133
+ year={2025},
134
+ month=feb
135
+ }
136
+
137
+ ```
138
+
139
+ ## Licensing Information
140
+
141
+ The dataset is released under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/deed.en), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
142
+
143
+ ## πŸ™ Acknowledgement
144
+
145
+ Our RSCC dataset is built based on [xBD](https://www.xview2.org/) and [EBD](https://figshare.com/articles/figure/An_Extended_Building_Damage_EBD_dataset_constructed_from_disaster-related_bi-temporal_remote_sensing_images_/25285009) datasets.
146
+
147
+ We are thankful to [Kimi-VL](https://hf-mirror.com/moonshotai/Kimi-VL-A3B-Instruct), [BLIP-3](https://hf-mirror.com/Salesforce/xgen-mm-phi3-mini-instruct-interleave-r-v1.5), [Phi-4-Multimodal](https://hf-mirror.com/microsoft/Phi-4-multimodal-instruct), [Qwen2-VL](https://hf-mirror.com/Qwen/Qwen2-VL-7B-Instruct), [Qwen2.5-VL](https://hf-mirror.com/Qwen/Qwen2.5-VL-72B-Instruct), [LLaVA-NeXT-Interleave](https://hf-mirror.com/llava-hf/llava-interleave-qwen-7b-hf),[LLaVA-OneVision](https://hf-mirror.com/llava-hf/llava-onevision-qwen2-7b-ov-hf), [InternVL 3](https://hf-mirror.com/OpenGVLab/InternVL3-8B), [Pixtral](https://hf-mirror.com/mistralai/Pixtral-12B-2409), [TEOChat](https://github.com/ermongroup/TEOChat) and [CCExpert](https://github.com/Meize0729/CCExpert) for releasing their models and code as open-source contributions.
148
+
149
+ The metrics implements are derived from [huggingface/evaluate](https://github.com/huggingface/evaluate).
150
+
151
+ The training implements are derived from [QwenLM/Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL).