--- license: apache-2.0 tags: - referring-image-segmentation - vision-language - multimodal - cross-modal-reasoning - graph-neural-network - pytorch ---

🚀 CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation

Mingzhu Xu1  Tianxiang Xiao1  Yutong Liu1  Haoyu Tang1  Yupeng Hu1✉  Liqiang Nie1

1Affiliation (Please update if needed)

These are the official implementation details and pre-trained models for **CMIRNet**, a Cross-Modal Interactive Reasoning Network designed for Referring Image Segmentation (RIS). 🔗 **Paper:** IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024 🔗 **Task:** Referring Image Segmentation (RIS) 🔗 **Framework:** PyTorch --- ## 📌 Model Information ### 1. Model Name **CMIRNet** (Cross-Modal Interactive Reasoning Network) --- ### 2. Task Type & Applicable Tasks - **Task Type:** Vision-Language / Multimodal Learning - **Core Task:** Referring Image Segmentation (RIS) - **Applicable Scenarios:** - Language-guided object segmentation - Cross-modal reasoning - Vision-language alignment - Scene understanding with textual queries --- ### 3. Project Introduction Referring Image Segmentation (RIS) aims to segment target objects in an image based on natural language descriptions. The key challenge lies in **fine-grained cross-modal alignment** and **complex reasoning between visual and linguistic modalities**. **CMIRNet** proposes a Cross-Modal Interactive Reasoning framework, which: - Introduces interactive reasoning mechanisms between visual and textual features - Enhances semantic alignment via multi-stage cross-modal fusion - Incorporates graph-based reasoning to capture complex relationships - Improves robustness under ambiguous or complex referring expressions --- ### 4. Training Data Source The model is trained and evaluated on: - RefCOCO - RefCOCO+ - RefCOCOg - RefCLEF Image data is based on: - MS COCO 2014 Train Set (83K images) --- ## 🚀 Usage & Basic Inference ### Step 1: Prepare Pre-trained Weights Download backbone weights: - ResNet-50 - ResNet-101 - Swin-Transformer-Base - Swin-Transformer-Large --- ### Step 2: Dataset Preparation 1. Download COCO 2014 training images 2. Extract to: ``` ./data/images/ ``` 3. Download referring datasets: ``` https://github.com/lichengunc/refer ``` --- ### Step 3: Training #### ResNet-based Training ```bash python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0 python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+ python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd ``` #### Swin-Transformer-based Training ```bash python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0 python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+ python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umd ``` --- ### Step 4: Testing / Inference #### ResNet-based Testing ```bash python test_resnet.py --device cuda:0 --resume path/to/weights python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcoco+ python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd ``` #### Swin-Transformer-based Testing ```bash python test_swin.py --device cuda:0 --resume path/to/weights --window12 python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcoco+ --window12 python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd --window12 ``` --- ## ⚠️ Limitations & Notes - For academic research use only - Performance depends on dataset quality and referring expression clarity - May degrade under: - ambiguous language - complex scenes - domain shift - Requires substantial GPU resources for training --- ## 📝⭐️ Citation ```bibtex @ARTICLE{CMIRNet, author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang}, journal={IEEE Transactions on Circuits and Systems for Video Technology}, title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation}, year={2024}, pages={1-1}, keywords={Referring Image Segmentation; Vision-Language; Cross Modal Reasoning; Graph Neural Network}, doi={10.1109/TCSVT.2024.3508752} } ``` --- ## ⭐ Acknowledgement This work builds upon advances in: - Vision-language modeling - Transformer architectures - Graph neural networks --- ## 📬 Contact For questions or collaboration, please contact the corresponding author.