AuralSAM2 / docs /installation.md
yyliu01's picture
Upload folder using huggingface_hub
c6dfc69 verified

Installation

The project is based on Python and PyTorch. We usually run experiments with multi-GPU training.

Tested runtime:

  • Python 3.12.3
  • PyTorch 2.8.0+cu128

πŸ“₯ Clone the Git repo

$ https://github.com/yyliu01/AuralSAM2
$ cd AuralSAM2

🧩 Install dependencies

  1. create conda env from yaml
$ conda env create -f docs/auralsam2.yml
  1. activate env
$ conda activate auralsam2
  1. install PyTorch (recommended: match tested runtime)
# CUDA 12.8 (tested):
$ pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
  1. install python packages (if needed)
$ pip install -r docs/requirements.txt

πŸ—‚οΈ Prepare dataset

AVSBench (avs.code)

  1. download and prepare AVSBench under repository root.
  2. ensure the dataset root path is:
    • AVSBench/
    • AVSBench/avss_index/metadata.csv (and subset folders v1s/, v1m/, v2/)

Ref-AVS (ref-avs.code)

  1. download and prepare the Ref-AVS (REFAVS) dataset under repository root.
  2. ensure the dataset root path is:
    • REFAVS/
    • REFAVS/metadata.csv (splits: train, test_s, test_u, test_n)

Checkpoints (shared)

Prepare under repository root:

  • ckpts/sam_ckpts/sam2_hiera_large.pt
  • ckpts/vggish-10086976.pth

πŸ—οΈ Workspace structure

AuralSAM2/
β”œβ”€β”€ avs.code/
β”‚   β”œβ”€β”€ v1s.code/
β”‚   β”œβ”€β”€ v1m.code/
β”‚   └── v2.code/
β”œβ”€β”€ ref-avs.code/
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_avs_train.sh
β”‚   └── run_ref_train.sh
β”œβ”€β”€ AVSBench/
β”‚   β”œβ”€β”€ avss_index
β”‚   β”‚   β”œβ”€β”€ metadata.csv
β”‚   β”‚   β”œβ”€β”€ metadata_v1m_man.csv
β”‚   β”‚   └── metadata_v2_man.csv
β”‚   β”œβ”€β”€ v1m
β”‚   β”‚   β”œβ”€β”€ 01uIJMwnUvA_0
β”‚   β”‚   β”œβ”€β”€ 0WxgIKuetYI_0
β”‚   β”‚   ... (419 more)
β”‚   β”œβ”€β”€ v1s
β”‚   β”‚   β”œβ”€β”€ --FenyW2i_4_5000_10000
β”‚   β”‚   β”œβ”€β”€ --ZHUMfueO0_5000_10000
β”‚   β”‚   ... (4927 more)
β”‚   └── v2
β”‚       β”œβ”€β”€ --KCIeTv6PM_14000_24000
β”‚       β”œβ”€β”€ --iSerV5DbY_68000_78000
β”‚       ... (5995 more)
β”œβ”€β”€ REFAVS/
β”‚   β”œβ”€β”€ gt_mask
β”‚   β”‚   β”œβ”€β”€ --KCIeTv6PM_14000_24000
β”‚   β”‚   β”œβ”€β”€ --iSerV5DbY_68000_78000
β”‚   β”‚   ... (~4000 more)
β”‚   β”œβ”€β”€ media
β”‚   β”‚   β”œβ”€β”€ --KCIeTv6PM_14000_24000
β”‚   β”‚   β”œβ”€β”€ --iSerV5DbY_68000_78000
β”‚   β”‚   ... (~4300 more)
β”‚   └── metadata.csv
β”œβ”€β”€ ckpts/
β”‚   β”œβ”€β”€ sam_ckpts/
β”‚   β”‚   └── sam2_hiera_large.pt
β”‚   └── vggish-10086976.pth
└── docs/
    β”œβ”€β”€ installation.md
    β”œβ”€β”€ before_start.md
    β”œβ”€β”€ requirements.txt
    └── auralsam2.yml

πŸ“ Notes

  • use docs/before_start.md for training and inference commands.
  • if wandb is not needed, disable online logging in your config.