---
license: apache-2.0
tags:
- referring-image-segmentation
- vision-language
- multimodal
- cross-modal-reasoning
- graph-neural-network
- pytorch
---
🚀 CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation
Mingzhu Xu1
Tianxiang Xiao1
Yutong Liu1
Haoyu Tang1
Yupeng Hu1✉
Liqiang Nie1
1Affiliation (Please update if needed)
These are the official implementation details and pre-trained models for **CMIRNet**, a Cross-Modal Interactive Reasoning Network designed for Referring Image Segmentation (RIS).
🔗 **Paper:** IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024
🔗 **Task:** Referring Image Segmentation (RIS)
🔗 **Framework:** PyTorch
---
## 📌 Model Information
### 1. Model Name
**CMIRNet** (Cross-Modal Interactive Reasoning Network)
---
### 2. Task Type & Applicable Tasks
- **Task Type:** Vision-Language / Multimodal Learning
- **Core Task:** Referring Image Segmentation (RIS)
- **Applicable Scenarios:**
- Language-guided object segmentation
- Cross-modal reasoning
- Vision-language alignment
- Scene understanding with textual queries
---
### 3. Project Introduction
Referring Image Segmentation (RIS) aims to segment target objects in an image based on natural language descriptions. The key challenge lies in **fine-grained cross-modal alignment** and **complex reasoning between visual and linguistic modalities**.
**CMIRNet** proposes a Cross-Modal Interactive Reasoning framework, which:
- Introduces interactive reasoning mechanisms between visual and textual features
- Enhances semantic alignment via multi-stage cross-modal fusion
- Incorporates graph-based reasoning to capture complex relationships
- Improves robustness under ambiguous or complex referring expressions
---
### 4. Training Data Source
The model is trained and evaluated on:
- RefCOCO
- RefCOCO+
- RefCOCOg
- RefCLEF
Image data is based on:
- MS COCO 2014 Train Set (83K images)
---
## 🚀 Usage & Basic Inference
### Step 1: Prepare Pre-trained Weights
Download backbone weights:
- ResNet-50
- ResNet-101
- Swin-Transformer-Base
- Swin-Transformer-Large
---
### Step 2: Dataset Preparation
1. Download COCO 2014 training images
2. Extract to:
```
./data/images/
```
3. Download referring datasets:
```
https://github.com/lichengunc/refer
```
---
### Step 3: Training
#### ResNet-based Training
```bash
python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0
python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+
python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd
```
#### Swin-Transformer-based Training
```bash
python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0
python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+
python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umd
```
---
### Step 4: Testing / Inference
#### ResNet-based Testing
```bash
python test_resnet.py --device cuda:0 --resume path/to/weights
python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcoco+
python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd
```
#### Swin-Transformer-based Testing
```bash
python test_swin.py --device cuda:0 --resume path/to/weights --window12
python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcoco+ --window12
python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd --window12
```
---
## ⚠️ Limitations & Notes
- For academic research use only
- Performance depends on dataset quality and referring expression clarity
- May degrade under:
- ambiguous language
- complex scenes
- domain shift
- Requires substantial GPU resources for training
---
## 📝⭐️ Citation
```bibtex
@ARTICLE{CMIRNet,
author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation},
year={2024},
pages={1-1},
keywords={Referring Image Segmentation; Vision-Language; Cross Modal Reasoning; Graph Neural Network},
doi={10.1109/TCSVT.2024.3508752}
}
```
---
## ⭐ Acknowledgement
This work builds upon advances in:
- Vision-language modeling
- Transformer architectures
- Graph neural networks
---
## 📬 Contact
For questions or collaboration, please contact the corresponding author.