File size: 2,666 Bytes
f1106d1 0f55e72 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | # SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
[](https://arxiv.org/abs/2509.17537)
---
## 📰 News
[//]: # (🔥**2026.1.18**: Code are released now!)
🔥**2026.1.18**: Our paper got accepted to **ICASSP 2026**! Thanks to all co-authors and the anonymous reviewers🎉🎉
---
## ⚙️ Setup
### Datasets
Download the official Ref-AVSBench dataset from [here](https://github.com/GeWu-Lab/Ref-AVS) and organize the dataset as follows:
```
./REFAVS/data
- /media
- /gt_mask
- /metadata.csv
```
### Pretrained Backbones
Download the sam_vit_h_4b8939.pth and put it in ```./models/segment_anything```
### Checkpoints
Download our pretrained **[Simtoken](https://drive.google.com/file/d/1pargYfFy93rymCANuWV0nt6Lx3Ri406l/view?usp=sharing)**.
### Core Requirements
This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.
- `numpy`, `pandas`, `matplotlib`, `opencv`
- `einops`, `timm`
- `sentencepiece`
- `transformers`, `peft`
Newer versions of `transformers` and `peft` may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).
To avoid such compatibility issues, we recommend **not using overly recent versions** and pin the two packages to the versions used during our development:
- `transformers==4.30.2`
- `peft==0.2.0`
We also provide a complete requirements.txt for reference and easier reproduction:
```
pip install -r requirements.txt
```
---
## 📌 Getting Started
### Preparation
We recommend running the following code to pre-extract audio features and visual features compatible with SAM:
```
python save_audio_feats.py --data_dir 'path/to/data'
python save_sam_feats.py --data_dir 'path/to/data'
```
### Train
To train our model on Ref-AVS Bench:
```
python -W ignore train.py --name 'xxx' \
--vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
--vision_tower 'openai/clip-vit-large-patch14' \
--mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
--data_dir 'path/to/data'\
--log_root 'path/to/log_root'\
--checkpoint_root 'path/to/checkpoints_root'
```
### Test
To test our pretrained simtoken:
```
python -W ignore load_model.py --saved_model 'path/to/checkpoint.pth' \
--vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
--vision_tower 'openai/clip-vit-large-patch14' \
--mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
--data_dir 'path/to/data' \
--visualization_root 'path/to/visualization_root'
``` |