🚀 CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation

---
license: apache-2.0
tags:
- referring-image-segmentation
- vision-language
- multimodal
- cross-modal-reasoning
- graph-neural-network
- pytorch
---

<a id="top"></a>
<div align="center">
  <h1>🚀 CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation</h1>

  <p>
    <b>Mingzhu Xu</b><sup>1</sup>&nbsp;
    <b>Tianxiang Xiao</b><sup>1</sup>&nbsp;
    <b>Yutong Liu</b><sup>1</sup>&nbsp;
    <b>Haoyu Tang</b><sup>1</sup>&nbsp;
    <b>Yupeng Hu</b><sup>1✉</sup>&nbsp;
    <b>Liqiang Nie</b><sup>1</sup>
  </p>

  <p>
    <sup>1</sup>Affiliation (Please update if needed)
  </p>
</div>

These are the official implementation details and pre-trained models for **CMIRNet**, a Cross-Modal Interactive Reasoning Network designed for Referring Image Segmentation (RIS).

🔗 **Paper:** IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024  
🔗 **Task:** Referring Image Segmentation (RIS)  
🔗 **Framework:** PyTorch  

---

## 📌 Model Information

### 1. Model Name
**CMIRNet** (Cross-Modal Interactive Reasoning Network)

---

### 2. Task Type & Applicable Tasks
- **Task Type:** Vision-Language / Multimodal Learning  
- **Core Task:** Referring Image Segmentation (RIS)  
- **Applicable Scenarios:**
  - Language-guided object segmentation  
  - Cross-modal reasoning  
  - Vision-language alignment  
  - Scene understanding with textual queries  

---

### 3. Project Introduction

Referring Image Segmentation (RIS) aims to segment target objects in an image based on natural language descriptions. The key challenge lies in **fine-grained cross-modal alignment** and **complex reasoning between visual and linguistic modalities**.

**CMIRNet** proposes a Cross-Modal Interactive Reasoning framework, which:

- Introduces interactive reasoning mechanisms between visual and textual features  
- Enhances semantic alignment via multi-stage cross-modal fusion  
- Incorporates graph-based reasoning to capture complex relationships  
- Improves robustness under ambiguous or complex referring expressions  

---

### 4. Training Data Source

The model is trained and evaluated on:

- RefCOCO  
- RefCOCO+  
- RefCOCOg  
- RefCLEF  

Image data is based on:

- MS COCO 2014 Train Set (83K images)

---

## 🚀 Usage & Basic Inference

### Step 1: Prepare Pre-trained Weights

Download backbone weights:

- ResNet-50  
- ResNet-101  
- Swin-Transformer-Base  
- Swin-Transformer-Large  

---

### Step 2: Dataset Preparation

1. Download COCO 2014 training images  
2. Extract to:

```
./data/images/
```

3. Download referring datasets:

```
https://github.com/lichengunc/refer
```

---

### Step 3: Training

#### ResNet-based Training
```bash
python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0

python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+

python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd
```

#### Swin-Transformer-based Training
```bash
python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0

python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+

python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umd
```

---

### Step 4: Testing / Inference

#### ResNet-based Testing
```bash
python test_resnet.py --device cuda:0 --resume path/to/weights

python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcoco+

python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd
```

#### Swin-Transformer-based Testing
```bash
python test_swin.py --device cuda:0 --resume path/to/weights --window12

python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcoco+ --window12

python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd --window12
```

---

## ⚠️ Limitations & Notes

- For academic research use only  
- Performance depends on dataset quality and referring expression clarity  
- May degrade under:
  - ambiguous language  
  - complex scenes  
  - domain shift  
- Requires substantial GPU resources for training  

---

## 📝⭐️ Citation

```bibtex
@ARTICLE{CMIRNet,
  author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation}, 
  year={2024},
  pages={1-1},
  keywords={Referring Image Segmentation; Vision-Language; Cross Modal Reasoning; Graph Neural Network},
  doi={10.1109/TCSVT.2024.3508752}
}
```

---

## ⭐ Acknowledgement

This work builds upon advances in:

- Vision-language modeling  
- Transformer architectures  
- Graph neural networks  

---

## 📬 Contact

For questions or collaboration, please contact the corresponding author.