Installation

The project is based on Python and PyTorch. We usually run experiments with multi-GPU training.

Tested runtime:

Python 3.12.3
PyTorch 2.8.0+cu128

📥 Clone the Git repo

$ https://github.com/yyliu01/AuralSAM2
$ cd AuralSAM2

🧩 Install dependencies

create conda env from yaml

$ conda env create -f docs/auralsam2.yml

activate env

$ conda activate auralsam2

install PyTorch (recommended: match tested runtime)

# CUDA 12.8 (tested):
$ pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

install python packages (if needed)

$ pip install -r docs/requirements.txt

🗂️ Prepare dataset

AVSBench (`avs.code`)

download and prepare AVSBench under repository root.
ensure the dataset root path is:
- AVSBench/
- AVSBench/avss_index/metadata.csv (and subset folders v1s/, v1m/, v2/)

Ref-AVS (`ref-avs.code`)

download and prepare the Ref-AVS (REFAVS) dataset under repository root.
ensure the dataset root path is:
- REFAVS/
- REFAVS/metadata.csv (splits: train, test_s, test_u, test_n)

Checkpoints (shared)

Prepare under repository root:

ckpts/sam_ckpts/sam2_hiera_large.pt
ckpts/vggish-10086976.pth

🏗️ Workspace structure

AuralSAM2/
├── avs.code/
│   ├── v1s.code/
│   ├── v1m.code/
│   └── v2.code/
├── ref-avs.code/
├── scripts/
│   ├── run_avs_train.sh
│   └── run_ref_train.sh
├── AVSBench/
│   ├── avss_index
│   │   ├── metadata.csv
│   │   ├── metadata_v1m_man.csv
│   │   └── metadata_v2_man.csv
│   ├── v1m
│   │   ├── 01uIJMwnUvA_0
│   │   ├── 0WxgIKuetYI_0
│   │   ... (419 more)
│   ├── v1s
│   │   ├── --FenyW2i_4_5000_10000
│   │   ├── --ZHUMfueO0_5000_10000
│   │   ... (4927 more)
│   └── v2
│       ├── --KCIeTv6PM_14000_24000
│       ├── --iSerV5DbY_68000_78000
│       ... (5995 more)
├── REFAVS/
│   ├── gt_mask
│   │   ├── --KCIeTv6PM_14000_24000
│   │   ├── --iSerV5DbY_68000_78000
│   │   ... (~4000 more)
│   ├── media
│   │   ├── --KCIeTv6PM_14000_24000
│   │   ├── --iSerV5DbY_68000_78000
│   │   ... (~4300 more)
│   └── metadata.csv
├── ckpts/
│   ├── sam_ckpts/
│   │   └── sam2_hiera_large.pt
│   └── vggish-10086976.pth
└── docs/
    ├── installation.md
    ├── before_start.md
    ├── requirements.txt
    └── auralsam2.yml

📝 Notes

use docs/before_start.md for training and inference commands.
if wandb is not needed, disable online logging in your config.

Installation

📥 Clone the Git repo

🧩 Install dependencies

🗂️ Prepare dataset

AVSBench (avs.code)

Ref-AVS (ref-avs.code)

Checkpoints (shared)

🏗️ Workspace structure

📝 Notes

AVSBench (`avs.code`)

Ref-AVS (`ref-avs.code`)