yfan07
/

SimToken

Model card Files Files and versions

SimToken / README.md

yfan07's picture

Add files using upload-large-folder tool

0f55e72 verified 13 days ago

|

history blame contribute delete

2.67 kB

	# SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
	[![TGS](https://img.shields.io/badge/Paper-SimToken-red?logo=arXiv)](https://arxiv.org/abs/2509.17537)

	---
	## 📰 News

	[//]: # (🔥2026.1.18: Code are released now！)

	🔥2026.1.18: Our paper got accepted to ICASSP 2026! Thanks to all co-authors and the anonymous reviewers🎉🎉

	---
	## ⚙️ Setup

	### Datasets

	Download the official Ref-AVSBench dataset from [here](https://github.com/GeWu-Lab/Ref-AVS) and organize the dataset as follows:
	```
	./REFAVS/data
	- /media
	- /gt_mask
	- /metadata.csv
	```

	### Pretrained Backbones
	Download the sam_vit_h_4b8939.pth and put it in ```./models/segment_anything```
	### Checkpoints
	Download our pretrained [Simtoken](https://drive.google.com/file/d/1pargYfFy93rymCANuWV0nt6Lx3Ri406l/view?usp=sharing).
	### Core Requirements
	This project depends on a small set of core packages. The configuration below has been tested and is recommended for stable execution.
	- `numpy`, `pandas`, `matplotlib`, `opencv`
	- `einops`, `timm`
	- `sentencepiece`
	- `transformers`, `peft`
	Newer versions of `transformers` and `peft` may introduce API changes or naming/registration conflicts that can trigger runtime errors in this project (e.g., custom model/config registration).
	To avoid such compatibility issues, we recommend not using overly recent versions and pin the two packages to the versions used during our development:
	- `transformers==4.30.2`
	- `peft==0.2.0`
	We also provide a complete requirements.txt for reference and easier reproduction:
	```
	pip install -r requirements.txt
	```
	---
	## 📌 Getting Started
	### Preparation
	We recommend running the following code to pre-extract audio features and visual features compatible with SAM:
	```
	python save_audio_feats.py --data_dir 'path/to/data'
	python save_sam_feats.py --data_dir 'path/to/data'
	```
	### Train
	To train our model on Ref-AVS Bench:
	```
	python -W ignore train.py --name 'xxx' \
	--vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
	--vision_tower 'openai/clip-vit-large-patch14' \
	--mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
	--data_dir 'path/to/data'\
	--log_root 'path/to/log_root'\
	--checkpoint_root 'path/to/checkpoints_root'
	```
	### Test
	To test our pretrained simtoken:
	```
	python -W ignore load_model.py --saved_model 'path/to/checkpoint.pth' \
	--vision_pretrained 'path/to/segment_anything/sam_vit_h_4b8939.pth' \
	--vision_tower 'openai/clip-vit-large-patch14' \
	--mllm 'Chat-UniVi/Chat-UniVi-7B-v1.5' \
	--data_dir 'path/to/data' \
	--visualization_root 'path/to/visualization_root'
	```