# Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
[](https://pytorch.org/)
[](https://www.python.org/downloads/)
[](https://paperswithcode.com/sota/referring-video-object-segmentation-on-mevis?p=decoupling-static-and-hierarchical-motion)
**[📄[arXiv]](https://arxiv.org/abs/2404.03645v1)** **[📄[PDF]](https://openaccess.thecvf.com/content/CVPR2024/papers/He_Decoupling_Static_and_Hierarchical_Motion_Perception_for_Referring_Video_Segmentation_CVPR_2024_paper.pdf)**
This repository contains code for **CVPR2024** paper:
> [Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation](https://arxiv.org/abs/2404.03645v1)
> Shuting He, Henghui Ding
> CVPR 2024
## Installation:
Please see [INSTALL.md](https://github.com/henghuiding/MeViS/blob/main/INSTALL.md). Then
```
pip install -r requirements.txt
python3 -m spacy download en_core_web_sm
```
## Inference
### 1. Valu set
Obtain the output masks of Valu set:
```
python train_net_dshmp.py \
--config-file configs/dshmp_swin_tiny.yaml \
--num-gpus 8 --dist-url auto --eval-only \
MODEL.WEIGHTS [path_to_weights] \
OUTPUT_DIR [output_dir]
```
Obtain the J&F results on Valu set:
```
python tools/eval_mevis.py
```
### 2. Val set
Obtain the output masks of Val set for [CodaLab](https://codalab.lisn.upsaclay.fr/competitions/15094) online evaluation:
```
python train_net_dshmp.py \
--config-file configs/dshmp_swin_tiny.yaml \
--num-gpus 8 --dist-url auto --eval-only \
MODEL.WEIGHTS [path_to_weights] \
OUTPUT_DIR [output_dir] DATASETS.TEST '("mevis_test",)'
```
## Training
Firstly, download the backbone weights (`model_final_86143f.pkl`) and convert it using the script:
```
wget https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/instance/maskformer2_swin_tiny_bs16_50ep/model_final_86143f.pkl
python tools/process_ckpt.py
python tools/get_refer_id.py
```
Then start training:
```
python train_net_dshmp.py \
--config-file configs/dshmp_swin_tiny.yaml \
--num-gpus 8 --dist-url auto \
MODEL.WEIGHTS [path_to_weights] \
OUTPUT_DIR [path_to_weights]
```
Note: We train on a 3090 machine using 8 cards with 1 sample on each card, taking about 17 hours.
## Models
☁️ [Google Drive](https://drive.google.com/file/d/1YLnRUsANuPVfLo1jrgK05EGUJglrwA9H/view?usp=drive_link)
## Acknowledgement
This project is based on [MeViS](https://github.com/henghuiding/MeViS). Many thanks to the authors for their great works!
## BibTeX
Please consider to cite DsHmp if it helps your research.
```bibtex
@inproceedings{DsHmp,
title={Decoupling static and hierarchical motion perception for referring video segmentation},
author={He, Shuting and Ding, Henghui},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13332--13341},
year={2024}
}
```