| # Installation |
|
|
| The project is based on Python and PyTorch. We usually run experiments with multi-GPU training. |
|
|
| Tested runtime: |
| - Python `3.12.3` |
| - PyTorch `2.8.0+cu128` |
|
|
| ## π₯ Clone the Git repo |
|
|
| ``` shell |
| $ https://github.com/yyliu01/AuralSAM2 |
| $ cd AuralSAM2 |
| ``` |
|
|
| ## π§© Install dependencies |
|
|
| 1) create conda env from yaml |
| ```shell |
| $ conda env create -f docs/auralsam2.yml |
| ``` |
|
|
| 2) activate env |
| ```shell |
| $ conda activate auralsam2 |
| ``` |
|
|
| 3) install PyTorch (recommended: match tested runtime) |
| ```shell |
| # CUDA 12.8 (tested): |
| $ pip install torch==2.8.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 |
| ``` |
|
|
| 4) install python packages (if needed) |
| ```shell |
| $ pip install -r docs/requirements.txt |
| ``` |
|
|
| ## ποΈ Prepare dataset |
|
|
| ### AVSBench (`avs.code`) |
|
|
| 1) download and prepare AVSBench under repository root. |
| 2) ensure the dataset root path is: |
| - `AVSBench/` |
| - `AVSBench/avss_index/metadata.csv` (and subset folders `v1s/`, `v1m/`, `v2/`) |
|
|
| ### Ref-AVS (`ref-avs.code`) |
|
|
| 1) download and prepare the Ref-AVS (REFAVS) dataset under repository root. |
| 2) ensure the dataset root path is: |
| - `REFAVS/` |
| - `REFAVS/metadata.csv` (splits: `train`, `test_s`, `test_u`, `test_n`) |
|
|
|
|
| ### Checkpoints (shared) |
|
|
| Prepare under repository root: |
|
|
| - `ckpts/sam_ckpts/sam2_hiera_large.pt` |
| - `ckpts/vggish-10086976.pth` |
|
|
| ## ποΈ Workspace structure |
|
|
| ```shell |
| AuralSAM2/ |
| βββ avs.code/ |
| β βββ v1s.code/ |
| β βββ v1m.code/ |
| β βββ v2.code/ |
| βββ ref-avs.code/ |
| βββ scripts/ |
| β βββ run_avs_train.sh |
| β βββ run_ref_train.sh |
| βββ AVSBench/ |
| β βββ avss_index |
| β β βββ metadata.csv |
| β β βββ metadata_v1m_man.csv |
| β β βββ metadata_v2_man.csv |
| β βββ v1m |
| β β βββ 01uIJMwnUvA_0 |
| β β βββ 0WxgIKuetYI_0 |
| β β ... (419 more) |
| β βββ v1s |
| β β βββ --FenyW2i_4_5000_10000 |
| β β βββ --ZHUMfueO0_5000_10000 |
| β β ... (4927 more) |
| β βββ v2 |
| β βββ --KCIeTv6PM_14000_24000 |
| β βββ --iSerV5DbY_68000_78000 |
| β ... (5995 more) |
| βββ REFAVS/ |
| β βββ gt_mask |
| β β βββ --KCIeTv6PM_14000_24000 |
| β β βββ --iSerV5DbY_68000_78000 |
| β β ... (~4000 more) |
| β βββ media |
| β β βββ --KCIeTv6PM_14000_24000 |
| β β βββ --iSerV5DbY_68000_78000 |
| β β ... (~4300 more) |
| β βββ metadata.csv |
| βββ ckpts/ |
| β βββ sam_ckpts/ |
| β β βββ sam2_hiera_large.pt |
| β βββ vggish-10086976.pth |
| βββ docs/ |
| βββ installation.md |
| βββ before_start.md |
| βββ requirements.txt |
| βββ auralsam2.yml |
| ``` |
|
|
| ## π Notes |
|
|
| - use `docs/before_start.md` for training and inference commands. |
| - if wandb is not needed, disable online logging in your config. |
|
|