Spaces:
Running
on
Zero
Running
on
Zero
| # Evaluation | |
| This directory contains the evaluation code to reproduce the results from the SAM-Audio paper. The evaluation framework supports multiple datasets, prompting modes (text-only, span, visual), and metrics. | |
| ## Setup | |
| Before running evaluation, ensure you have: | |
| 1. Installed the SAM-Audio package and its dependencies | |
| 2. Authenticated with Hugging Face to access the model checkpoints (see main [README](../README.md)) | |
| ## Quick Start | |
| Run evaluation on the default setting (instr-pro): | |
| ```bash | |
| python main.py | |
| ``` | |
| You can also use multiple GPUs to speed up evaluation: | |
| ```bash | |
| torchrun --nproc_per_node=<ngpus> python main.py | |
| ``` | |
| Evaluate on a specific setting: | |
| ```bash | |
| python main.py --setting sfx | |
| ``` | |
| Evaluate on multiple settings: | |
| ```bash | |
| python main.py --setting sfx speech music | |
| ``` | |
| ## Available Evaluation Settings | |
| Run `python main.py --help` to see all available settings | |
| ## Command Line Options | |
| ```bash | |
| python main.py [OPTIONS] | |
| ``` | |
| ### Options: | |
| - `-s, --setting` - Which setting(s) to evaluate (default: `instr-pro`) | |
| - Choices: See available settings above | |
| - Can specify multiple settings: `--setting sfx speech music` | |
| - `--cache-path` - Where to cache downloaded datasets (default: `~/.cache/sam_audio`) | |
| - `-p, --checkpoint-path` - Model checkpoint to evaluate (default: `facebook/sam-audio-1b`) | |
| - Can use local path or Hugging Face model ID | |
| - `-b, --batch-size` - Batch size for evaluation (default: `1`) | |
| - `-w, --num-workers` - Number of data loading workers (default: `4`) | |
| - `-c, --candidates` - Number of reranking candidates (default: `8`) | |
| ## Evaluation Metrics | |
| The evaluation framework computes the following metrics: | |
| - **Judge** - SAM Audio Judge quality assessment metric | |
| - **Aesthetic** - Aesthetic quality metric | |
| - **CLAP** - Audio-text alignment metric (CLAP similarity) | |
| - **ImageBind** - Audio-video alignment metric (for visual settings only) | |
| ## Output | |
| Results are saved to the `results/` directory as JSON files, one per setting: | |
| ``` | |
| results/ | |
| βββ sfx.json | |
| βββ speech.json | |
| βββ music.json | |
| ``` | |
| Each JSON file contains the averaged metric scores across all samples in that setting. | |
| Example output: | |
| ```json | |
| { | |
| "JudgeOverall": "4.386", | |
| "JudgeFaithfulness": "4.708", | |
| "JudgeRecall": "4.934", | |
| "JudgePrecision": "4.451", | |
| "ContentEnjoyment": "5.296", | |
| "ContentUsefulness": "6.903", | |
| "ProductionComplexity": "4.301", | |
| "ProductionQuality": "7.100", | |
| "CLAPSimilarity": "0.271" | |
| } | |
| ``` | |