File size: 4,984 Bytes
c878954 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | ---
license: apache-2.0
tags:
- referring-image-segmentation
- vision-language
- multimodal
- cross-modal-reasoning
- graph-neural-network
- pytorch
---
<a id="top"></a>
<div align="center">
<h1>π CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation</h1>
<p>
<b>Mingzhu Xu</b><sup>1</sup>
<b>Tianxiang Xiao</b><sup>1</sup>
<b>Yutong Liu</b><sup>1</sup>
<b>Haoyu Tang</b><sup>1</sup>
<b>Yupeng Hu</b><sup>1β</sup>
<b>Liqiang Nie</b><sup>1</sup>
</p>
<p>
<sup>1</sup>Affiliation (Please update if needed)
</p>
</div>
These are the official implementation details and pre-trained models for **CMIRNet**, a Cross-Modal Interactive Reasoning Network designed for Referring Image Segmentation (RIS).
π **Paper:** IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024
π **Task:** Referring Image Segmentation (RIS)
π **Framework:** PyTorch
---
## π Model Information
### 1. Model Name
**CMIRNet** (Cross-Modal Interactive Reasoning Network)
---
### 2. Task Type & Applicable Tasks
- **Task Type:** Vision-Language / Multimodal Learning
- **Core Task:** Referring Image Segmentation (RIS)
- **Applicable Scenarios:**
- Language-guided object segmentation
- Cross-modal reasoning
- Vision-language alignment
- Scene understanding with textual queries
---
### 3. Project Introduction
Referring Image Segmentation (RIS) aims to segment target objects in an image based on natural language descriptions. The key challenge lies in **fine-grained cross-modal alignment** and **complex reasoning between visual and linguistic modalities**.
**CMIRNet** proposes a Cross-Modal Interactive Reasoning framework, which:
- Introduces interactive reasoning mechanisms between visual and textual features
- Enhances semantic alignment via multi-stage cross-modal fusion
- Incorporates graph-based reasoning to capture complex relationships
- Improves robustness under ambiguous or complex referring expressions
---
### 4. Training Data Source
The model is trained and evaluated on:
- RefCOCO
- RefCOCO+
- RefCOCOg
- RefCLEF
Image data is based on:
- MS COCO 2014 Train Set (83K images)
---
## π Usage & Basic Inference
### Step 1: Prepare Pre-trained Weights
Download backbone weights:
- ResNet-50
- ResNet-101
- Swin-Transformer-Base
- Swin-Transformer-Large
---
### Step 2: Dataset Preparation
1. Download COCO 2014 training images
2. Extract to:
```
./data/images/
```
3. Download referring datasets:
```
https://github.com/lichengunc/refer
```
---
### Step 3: Training
#### ResNet-based Training
```bash
python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0
python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+
python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd
```
#### Swin-Transformer-based Training
```bash
python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0
python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+
python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umd
```
---
### Step 4: Testing / Inference
#### ResNet-based Testing
```bash
python test_resnet.py --device cuda:0 --resume path/to/weights
python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcoco+
python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd
```
#### Swin-Transformer-based Testing
```bash
python test_swin.py --device cuda:0 --resume path/to/weights --window12
python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcoco+ --window12
python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd --window12
```
---
## β οΈ Limitations & Notes
- For academic research use only
- Performance depends on dataset quality and referring expression clarity
- May degrade under:
- ambiguous language
- complex scenes
- domain shift
- Requires substantial GPU resources for training
---
## πβοΈ Citation
```bibtex
@ARTICLE{CMIRNet,
author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation},
year={2024},
pages={1-1},
keywords={Referring Image Segmentation; Vision-Language; Cross Modal Reasoning; Graph Neural Network},
doi={10.1109/TCSVT.2024.3508752}
}
```
---
## β Acknowledgement
This work builds upon advances in:
- Vision-language modeling
- Transformer architectures
- Graph neural networks
---
## π¬ Contact
For questions or collaboration, please contact the corresponding author.
|