| --- |
| license: apache-2.0 |
| tags: |
| - referring-image-segmentation |
| - vision-language |
| - multimodal |
| - cross-modal-reasoning |
| - graph-neural-network |
| - pytorch |
| --- |
| |
| <a id="top"></a> |
| <div align="center"> |
| <h1>π CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation</h1> |
|
|
| <p> |
| <b>Mingzhu Xu</b><sup>1</sup> |
| <b>Tianxiang Xiao</b><sup>1</sup> |
| <b>Yutong Liu</b><sup>1</sup> |
| <b>Haoyu Tang</b><sup>1</sup> |
| <b>Yupeng Hu</b><sup>1β</sup> |
| <b>Liqiang Nie</b><sup>1</sup> |
| </p> |
| |
| <p> |
| <sup>1</sup>Affiliation (Please update if needed) |
| </p> |
| </div> |
| |
| These are the official implementation details and pre-trained models for **CMIRNet**, a Cross-Modal Interactive Reasoning Network designed for Referring Image Segmentation (RIS). |
|
|
| π **Paper:** IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2024 |
| π **Task:** Referring Image Segmentation (RIS) |
| π **Framework:** PyTorch |
|
|
| --- |
|
|
| ## π Model Information |
|
|
| ### 1. Model Name |
| **CMIRNet** (Cross-Modal Interactive Reasoning Network) |
|
|
| --- |
|
|
| ### 2. Task Type & Applicable Tasks |
| - **Task Type:** Vision-Language / Multimodal Learning |
| - **Core Task:** Referring Image Segmentation (RIS) |
| - **Applicable Scenarios:** |
| - Language-guided object segmentation |
| - Cross-modal reasoning |
| - Vision-language alignment |
| - Scene understanding with textual queries |
|
|
| --- |
|
|
| ### 3. Project Introduction |
|
|
| Referring Image Segmentation (RIS) aims to segment target objects in an image based on natural language descriptions. The key challenge lies in **fine-grained cross-modal alignment** and **complex reasoning between visual and linguistic modalities**. |
|
|
| **CMIRNet** proposes a Cross-Modal Interactive Reasoning framework, which: |
|
|
| - Introduces interactive reasoning mechanisms between visual and textual features |
| - Enhances semantic alignment via multi-stage cross-modal fusion |
| - Incorporates graph-based reasoning to capture complex relationships |
| - Improves robustness under ambiguous or complex referring expressions |
|
|
| --- |
|
|
| ### 4. Training Data Source |
|
|
| The model is trained and evaluated on: |
|
|
| - RefCOCO |
| - RefCOCO+ |
| - RefCOCOg |
| - RefCLEF |
|
|
| Image data is based on: |
|
|
| - MS COCO 2014 Train Set (83K images) |
|
|
| --- |
|
|
| ## π Usage & Basic Inference |
|
|
| ### Step 1: Prepare Pre-trained Weights |
|
|
| Download backbone weights: |
|
|
| - ResNet-50 |
| - ResNet-101 |
| - Swin-Transformer-Base |
| - Swin-Transformer-Large |
|
|
| --- |
|
|
| ### Step 2: Dataset Preparation |
|
|
| 1. Download COCO 2014 training images |
| 2. Extract to: |
|
|
| ``` |
| ./data/images/ |
| ``` |
|
|
| 3. Download referring datasets: |
|
|
| ``` |
| https://github.com/lichengunc/refer |
| ``` |
|
|
| --- |
|
|
| ### Step 3: Training |
|
|
| #### ResNet-based Training |
| ```bash |
| python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0 |
| |
| python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+ |
| |
| python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd |
| ``` |
|
|
| #### Swin-Transformer-based Training |
| ```bash |
| python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0 |
| |
| python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+ |
| |
| python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umd |
| ``` |
|
|
| --- |
|
|
| ### Step 4: Testing / Inference |
|
|
| #### ResNet-based Testing |
| ```bash |
| python test_resnet.py --device cuda:0 --resume path/to/weights |
| |
| python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcoco+ |
| |
| python test_resnet.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd |
| ``` |
|
|
| #### Swin-Transformer-based Testing |
| ```bash |
| python test_swin.py --device cuda:0 --resume path/to/weights --window12 |
| |
| python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcoco+ --window12 |
| |
| python test_swin.py --device cuda:0 --resume path/to/weights --dataset refcocog --splitBy umd --window12 |
| ``` |
|
|
| --- |
|
|
| ## β οΈ Limitations & Notes |
|
|
| - For academic research use only |
| - Performance depends on dataset quality and referring expression clarity |
| - May degrade under: |
| - ambiguous language |
| - complex scenes |
| - domain shift |
| - Requires substantial GPU resources for training |
|
|
| --- |
|
|
| ## πβοΈ Citation |
|
|
| ```bibtex |
| @ARTICLE{CMIRNet, |
| author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang}, |
| journal={IEEE Transactions on Circuits and Systems for Video Technology}, |
| title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation}, |
| year={2024}, |
| pages={1-1}, |
| keywords={Referring Image Segmentation; Vision-Language; Cross Modal Reasoning; Graph Neural Network}, |
| doi={10.1109/TCSVT.2024.3508752} |
| } |
| ``` |
|
|
| --- |
|
|
| ## β Acknowledgement |
|
|
| This work builds upon advances in: |
|
|
| - Vision-language modeling |
| - Transformer architectures |
| - Graph neural networks |
|
|
| --- |
|
|
| ## π¬ Contact |
|
|
| For questions or collaboration, please contact the corresponding author. |
|
|