Update README.md

c878954 verified 6 days ago

4.98 kB

	---
	license: apache-2.0
	tags:
	- referring-image-segmentation
	- vision-language
	- multimodal
	- cross-modal-reasoning
	- graph-neural-network
	- pytorch
	---

	<a id="top"></a>
	<div align="center">
	<h1>🚀 CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation</h1>

	<p>
	<b>Mingzhu Xu</b><sup>1</sup>
	<b>Tianxiang Xiao</b><sup>1</sup>
	<b>Yutong Liu</b><sup>1</sup>
	<b>Haoyu Tang</b><sup>1</sup>
	<b>Yupeng Hu</b><sup>1✉</sup>
	<b>Liqiang Nie</b><sup>1</sup>
	</p>

	<p>
	<sup>1</sup>Affiliation (Please update if needed)
	</p>
	</div>

	These are the official implementation details and pre-trained models for CMIRNet, a Cross-Modal Interactive Reasoning Network designed for Referring Image Segmentation (RIS).

	🔗 Paper: IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024
	🔗 Task: Referring Image Segmentation (RIS)
	🔗 Framework: PyTorch

	---

	## 📌 Model Information

	### 1. Model Name
	CMIRNet (Cross-Modal Interactive Reasoning Network)

	---

	### 2. Task Type & Applicable Tasks
	- Task Type: Vision-Language / Multimodal Learning
	- Core Task: Referring Image Segmentation (RIS)
	- Applicable Scenarios:
	- Language-guided object segmentation
	- Cross-modal reasoning
	- Vision-language alignment
	- Scene understanding with textual queries

	---

	### 3. Project Introduction

	Referring Image Segmentation (RIS) aims to segment target objects in an image based on natural language descriptions. The key challenge lies in fine-grained cross-modal alignment and complex reasoning between visual and linguistic modalities.

	CMIRNet proposes a Cross-Modal Interactive Reasoning framework, which:

	- Introduces interactive reasoning mechanisms between visual and textual features
	- Enhances semantic alignment via multi-stage cross-modal fusion
	- Incorporates graph-based reasoning to capture complex relationships
	- Improves robustness under ambiguous or complex referring expressions

	---

	### 4. Training Data Source

	The model is trained and evaluated on:

	- RefCOCO
	- RefCOCO+
	- RefCOCOg
	- RefCLEF

	Image data is based on:

	- MS COCO 2014 Train Set (83K images)

	---

	## 🚀 Usage & Basic Inference

	### Step 1: Prepare Pre-trained Weights

	Download backbone weights:

	- ResNet-50
	- ResNet-101
	- Swin-Transformer-Base
	- Swin-Transformer-Large

	---

	### Step 2: Dataset Preparation

	1. Download COCO 2014 training images
	2. Extract to:

	```
	./data/images/
	```

	3. Download referring datasets:

	```
	https://github.com/lichengunc/refer
	```

	---

	### Step 3: Training

	#### ResNet-based Training
	```bash
	python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0

	python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+

	python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd
	```

	#### Swin-Transformer-based Training
	```bash
	python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0

	python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+

	python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umd
	```

	---

	### Step 4: Testing / Inference

	#### ResNet-based Testing
	```bash
	python test_resnet.py --device cuda:0 --resume path/to/weights

	python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcoco+

	python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd
	```

	#### Swin-Transformer-based Testing
	```bash
	python test_swin.py --device cuda:0 --resume path/to/weights --window12

	python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcoco+ --window12

	python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd --window12
	```

	---

	## ⚠️ Limitations & Notes

	- For academic research use only
	- Performance depends on dataset quality and referring expression clarity
	- May degrade under:
	- ambiguous language
	- complex scenes
	- domain shift
	- Requires substantial GPU resources for training

	---

	## 📝⭐️ Citation

	```bibtex
	@ARTICLE{CMIRNet,
	author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang},
	journal={IEEE Transactions on Circuits and Systems for Video Technology},
	title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation},
	year={2024},
	pages={1-1},
	keywords={Referring Image Segmentation; Vision-Language; Cross Modal Reasoning; Graph Neural Network},
	doi={10.1109/TCSVT.2024.3508752}
	}
	```

	---

	## ⭐ Acknowledgement

	This work builds upon advances in:

	- Vision-language modeling
	- Transformer architectures
	- Graph neural networks

	---

	## 📬 Contact

	For questions or collaboration, please contact the corresponding author.