iLearn-Lab
/

TCSVT25-CMIRNet

+---
+license: apache-2.0
+tags:
+- referring-image-segmentation
+- vision-language
+- multimodal
+- cross-modal-reasoning
+- graph-neural-network
+- pytorch
+---
+<a id="top"></a>
+<div align="center">
+  <h1>🚀 CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation</h1>
+  <p>
+    <b>Mingzhu Xu</b><sup>1</sup>&nbsp;
+    <b>Tianxiang Xiao</b><sup>1</sup>&nbsp;
+    <b>Yutong Liu</b><sup>1</sup>&nbsp;
+    <b>Haoyu Tang</b><sup>1</sup>&nbsp;
+    <b>Yupeng Hu</b><sup>1✉</sup>&nbsp;
+    <b>Liqiang Nie</b><sup>1</sup>
+  </p>
+  <p>
+    <sup>1</sup>Affiliation (Please update if needed)
+  </p>
+</div>
+These are the official implementation details and pre-trained models for **CMIRNet**, a Cross-Modal Interactive Reasoning Network designed for Referring Image Segmentation (RIS).
+🔗 **Paper:** IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024
+🔗 **Task:** Referring Image Segmentation (RIS)
+🔗 **Framework:** PyTorch
+---
+## 📌 Model Information
+### 1. Model Name
+**CMIRNet** (Cross-Modal Interactive Reasoning Network)
+---
+### 2. Task Type & Applicable Tasks
+- **Task Type:** Vision-Language / Multimodal Learning
+- **Core Task:** Referring Image Segmentation (RIS)
+- **Applicable Scenarios:**
+  - Language-guided object segmentation
+  - Cross-modal reasoning
+  - Vision-language alignment
+  - Scene understanding with textual queries
+---
+### 3. Project Introduction
+Referring Image Segmentation (RIS) aims to segment target objects in an image based on natural language descriptions. The key challenge lies in **fine-grained cross-modal alignment** and **complex reasoning between visual and linguistic modalities**.
+**CMIRNet** proposes a Cross-Modal Interactive Reasoning framework, which:
+- Introduces interactive reasoning mechanisms between visual and textual features
+- Enhances semantic alignment via multi-stage cross-modal fusion
+- Incorporates graph-based reasoning to capture complex relationships
+- Improves robustness under ambiguous or complex referring expressions
+---
+### 4. Training Data Source
+The model is trained and evaluated on:
+- RefCOCO
+- RefCOCO+
+- RefCOCOg
+- RefCLEF
+Image data is based on:
+- MS COCO 2014 Train Set (83K images)
+---
+## 🚀 Usage & Basic Inference
+### Step 1: Prepare Pre-trained Weights
+Download backbone weights:
+- ResNet-50
+- ResNet-101
+- Swin-Transformer-Base
+- Swin-Transformer-Large
+---
+### Step 2: Dataset Preparation
+1. Download COCO 2014 training images
+2. Extract to:
+```
+./data/images/
+```
+3. Download referring datasets:
+```
+https://github.com/lichengunc/refer
+```
+---
+### Step 3: Training
+#### ResNet-based Training
+```bash
+python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0
+python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+
+python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd
+```
+#### Swin-Transformer-based Training
+```bash
+python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0
+python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+
+python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umd
+```
+---
+### Step 4: Testing / Inference
+#### ResNet-based Testing
+```bash
+python test_resnet.py --device cuda:0 --resume path/to/weights
+python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcoco+
+python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd
+```
+#### Swin-Transformer-based Testing
+```bash
+python test_swin.py --device cuda:0 --resume path/to/weights --window12
+python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcoco+ --window12
+python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd --window12
+```
+---
+## ⚠️ Limitations & Notes
+- For academic research use only
+- Performance depends on dataset quality and referring expression clarity
+- May degrade under:
+  - ambiguous language
+  - complex scenes
+  - domain shift
+- Requires substantial GPU resources for training
+---
+## 📝⭐️ Citation
+```bibtex
+@ARTICLE{CMIRNet,
+  author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang},
+  journal={IEEE Transactions on Circuits and Systems for Video Technology},
+  title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation},
+  year={2024},
+  pages={1-1},
+  keywords={Referring Image Segmentation; Vision-Language; Cross Modal Reasoning; Graph Neural Network},
+  doi={10.1109/TCSVT.2024.3508752}
+}
+```
+---
+## ⭐ Acknowledgement
+This work builds upon advances in:
+- Vision-language modeling
+- Transformer architectures
+- Graph neural networks
+---
+## 📬 Contact
+For questions or collaboration, please contact the corresponding author.