xumingzhu989 commited on
Commit
c878954
Β·
verified Β·
1 Parent(s): e06b43e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -3
README.md CHANGED
@@ -1,3 +1,198 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - referring-image-segmentation
5
+ - vision-language
6
+ - multimodal
7
+ - cross-modal-reasoning
8
+ - graph-neural-network
9
+ - pytorch
10
+ ---
11
+
12
+ <a id="top"></a>
13
+ <div align="center">
14
+ <h1>πŸš€ CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation</h1>
15
+
16
+ <p>
17
+ <b>Mingzhu Xu</b><sup>1</sup>&nbsp;
18
+ <b>Tianxiang Xiao</b><sup>1</sup>&nbsp;
19
+ <b>Yutong Liu</b><sup>1</sup>&nbsp;
20
+ <b>Haoyu Tang</b><sup>1</sup>&nbsp;
21
+ <b>Yupeng Hu</b><sup>1βœ‰</sup>&nbsp;
22
+ <b>Liqiang Nie</b><sup>1</sup>
23
+ </p>
24
+
25
+ <p>
26
+ <sup>1</sup>Affiliation (Please update if needed)
27
+ </p>
28
+ </div>
29
+
30
+ These are the official implementation details and pre-trained models for **CMIRNet**, a Cross-Modal Interactive Reasoning Network designed for Referring Image Segmentation (RIS).
31
+
32
+ πŸ”— **Paper:** IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024
33
+ πŸ”— **Task:** Referring Image Segmentation (RIS)
34
+ πŸ”— **Framework:** PyTorch
35
+
36
+ ---
37
+
38
+ ## πŸ“Œ Model Information
39
+
40
+ ### 1. Model Name
41
+ **CMIRNet** (Cross-Modal Interactive Reasoning Network)
42
+
43
+ ---
44
+
45
+ ### 2. Task Type & Applicable Tasks
46
+ - **Task Type:** Vision-Language / Multimodal Learning
47
+ - **Core Task:** Referring Image Segmentation (RIS)
48
+ - **Applicable Scenarios:**
49
+ - Language-guided object segmentation
50
+ - Cross-modal reasoning
51
+ - Vision-language alignment
52
+ - Scene understanding with textual queries
53
+
54
+ ---
55
+
56
+ ### 3. Project Introduction
57
+
58
+ Referring Image Segmentation (RIS) aims to segment target objects in an image based on natural language descriptions. The key challenge lies in **fine-grained cross-modal alignment** and **complex reasoning between visual and linguistic modalities**.
59
+
60
+ **CMIRNet** proposes a Cross-Modal Interactive Reasoning framework, which:
61
+
62
+ - Introduces interactive reasoning mechanisms between visual and textual features
63
+ - Enhances semantic alignment via multi-stage cross-modal fusion
64
+ - Incorporates graph-based reasoning to capture complex relationships
65
+ - Improves robustness under ambiguous or complex referring expressions
66
+
67
+ ---
68
+
69
+ ### 4. Training Data Source
70
+
71
+ The model is trained and evaluated on:
72
+
73
+ - RefCOCO
74
+ - RefCOCO+
75
+ - RefCOCOg
76
+ - RefCLEF
77
+
78
+ Image data is based on:
79
+
80
+ - MS COCO 2014 Train Set (83K images)
81
+
82
+ ---
83
+
84
+ ## πŸš€ Usage & Basic Inference
85
+
86
+ ### Step 1: Prepare Pre-trained Weights
87
+
88
+ Download backbone weights:
89
+
90
+ - ResNet-50
91
+ - ResNet-101
92
+ - Swin-Transformer-Base
93
+ - Swin-Transformer-Large
94
+
95
+ ---
96
+
97
+ ### Step 2: Dataset Preparation
98
+
99
+ 1. Download COCO 2014 training images
100
+ 2. Extract to:
101
+
102
+ ```
103
+ ./data/images/
104
+ ```
105
+
106
+ 3. Download referring datasets:
107
+
108
+ ```
109
+ https://github.com/lichengunc/refer
110
+ ```
111
+
112
+ ---
113
+
114
+ ### Step 3: Training
115
+
116
+ #### ResNet-based Training
117
+ ```bash
118
+ python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0
119
+
120
+ python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+
121
+
122
+ python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd
123
+ ```
124
+
125
+ #### Swin-Transformer-based Training
126
+ ```bash
127
+ python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0
128
+
129
+ python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+
130
+
131
+ python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umd
132
+ ```
133
+
134
+ ---
135
+
136
+ ### Step 4: Testing / Inference
137
+
138
+ #### ResNet-based Testing
139
+ ```bash
140
+ python test_resnet.py --device cuda:0 --resume path/to/weights
141
+
142
+ python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcoco+
143
+
144
+ python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd
145
+ ```
146
+
147
+ #### Swin-Transformer-based Testing
148
+ ```bash
149
+ python test_swin.py --device cuda:0 --resume path/to/weights --window12
150
+
151
+ python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcoco+ --window12
152
+
153
+ python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd --window12
154
+ ```
155
+
156
+ ---
157
+
158
+ ## ⚠️ Limitations & Notes
159
+
160
+ - For academic research use only
161
+ - Performance depends on dataset quality and referring expression clarity
162
+ - May degrade under:
163
+ - ambiguous language
164
+ - complex scenes
165
+ - domain shift
166
+ - Requires substantial GPU resources for training
167
+
168
+ ---
169
+
170
+ ## πŸ“β­οΈ Citation
171
+
172
+ ```bibtex
173
+ @ARTICLE{CMIRNet,
174
+ author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang},
175
+ journal={IEEE Transactions on Circuits and Systems for Video Technology},
176
+ title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation},
177
+ year={2024},
178
+ pages={1-1},
179
+ keywords={Referring Image Segmentation; Vision-Language; Cross Modal Reasoning; Graph Neural Network},
180
+ doi={10.1109/TCSVT.2024.3508752}
181
+ }
182
+ ```
183
+
184
+ ---
185
+
186
+ ## ⭐ Acknowledgement
187
+
188
+ This work builds upon advances in:
189
+
190
+ - Vision-language modeling
191
+ - Transformer architectures
192
+ - Graph neural networks
193
+
194
+ ---
195
+
196
+ ## πŸ“¬ Contact
197
+
198
+ For questions or collaboration, please contact the corresponding author.