File size: 2,666 Bytes
f1106d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0f55e72
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
[![TGS](https://img.shields.io/badge/Paper-SimToken-red?logo=arXiv)](https://arxiv.org/abs/2509.17537)

---
## 📰 News

[//]: # (🔥**2026.1.18**: Code are released now!)

🔥**2026.1.18**: Our paper got accepted to **ICASSP 2026**! Thanks to all co-authors and the anonymous reviewers🎉🎉

---
## ⚙️ Setup

### Datasets

Download the official Ref-AVSBench dataset from [here](https://github.com/GeWu-Lab/Ref-AVS) and organize the dataset as follows:
```
./REFAVS/data 
    - /media 
    - /gt_mask 
    - /metadata.csv 
```

### Pretrained Backbones
Download the sam_vit_h_4b8939.pth and put it in ```./models/segment_anything```
### Checkpoints
Download our pretrained  **[Simtoken](https://drive.google.com/file/d/1pargYfFy93rymCANuWV0nt6Lx3Ri406l/view?usp=sharing)**.
### Core Requirements
This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.
- `numpy`, `pandas`, `matplotlib`, `opencv`
- `einops`, `timm`
- `sentencepiece`
- `transformers`, `peft`
Newer versions of `transformers` and `peft` may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).  
To avoid such compatibility issues, we recommend **not using overly recent versions** and pin the two packages to the versions used during our development:
- `transformers==4.30.2`
- `peft==0.2.0`
We also provide a complete requirements.txt for reference and easier reproduction:
```
pip install -r requirements.txt
```
---
## 📌 Getting Started
### Preparation
We recommend running the following code to pre-extract audio features and visual features compatible with SAM:
```
python save_audio_feats.py --data_dir 'path/to/data'
python save_sam_feats.py  --data_dir 'path/to/data'
```
### Train 
To train our model on Ref-AVS Bench:
```
python -W ignore train.py --name 'xxx' \
    --vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
    --vision_tower 'openai/clip-vit-large-patch14' \
    --mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
    --data_dir 'path/to/data'\
    --log_root 'path/to/log_root'\
    --checkpoint_root 'path/to/checkpoints_root'
```
### Test
To test our pretrained simtoken:
```
python -W ignore load_model.py  --saved_model 'path/to/checkpoint.pth' \
    --vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
    --vision_tower 'openai/clip-vit-large-patch14' \
    --mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
    --data_dir 'path/to/data' \
    --visualization_root 'path/to/visualization_root'
```