Sample-Audio / eval /README.md
ray-006's picture
Upload 43 files
fc605f9 verified
# Evaluation
This directory contains the evaluation code to reproduce the results from the SAM-Audio paper. The evaluation framework supports multiple datasets, prompting modes (text-only, span, visual), and metrics.
## Setup
Before running evaluation, ensure you have:
1. Installed the SAM-Audio package and its dependencies
2. Authenticated with Hugging Face to access the model checkpoints (see main [README](../README.md))
## Quick Start
Run evaluation on the default setting (instr-pro):
```bash
python main.py
```
You can also use multiple GPUs to speed up evaluation:
```bash
torchrun --nproc_per_node=<ngpus> python main.py
```
Evaluate on a specific setting:
```bash
python main.py --setting sfx
```
Evaluate on multiple settings:
```bash
python main.py --setting sfx speech music
```
## Available Evaluation Settings
Run `python main.py --help` to see all available settings
## Command Line Options
```bash
python main.py [OPTIONS]
```
### Options:
- `-s, --setting` - Which setting(s) to evaluate (default: `instr-pro`)
- Choices: See available settings above
- Can specify multiple settings: `--setting sfx speech music`
- `--cache-path` - Where to cache downloaded datasets (default: `~/.cache/sam_audio`)
- `-p, --checkpoint-path` - Model checkpoint to evaluate (default: `facebook/sam-audio-1b`)
- Can use local path or Hugging Face model ID
- `-b, --batch-size` - Batch size for evaluation (default: `1`)
- `-w, --num-workers` - Number of data loading workers (default: `4`)
- `-c, --candidates` - Number of reranking candidates (default: `8`)
## Evaluation Metrics
The evaluation framework computes the following metrics:
- **Judge** - SAM Audio Judge quality assessment metric
- **Aesthetic** - Aesthetic quality metric
- **CLAP** - Audio-text alignment metric (CLAP similarity)
- **ImageBind** - Audio-video alignment metric (for visual settings only)
## Output
Results are saved to the `results/` directory as JSON files, one per setting:
```
results/
β”œβ”€β”€ sfx.json
β”œβ”€β”€ speech.json
└── music.json
```
Each JSON file contains the averaged metric scores across all samples in that setting.
Example output:
```json
{
"JudgeOverall": "4.386",
"JudgeFaithfulness": "4.708",
"JudgeRecall": "4.934",
"JudgePrecision": "4.451",
"ContentEnjoyment": "5.296",
"ContentUsefulness": "6.903",
"ProductionComplexity": "4.301",
"ProductionQuality": "7.100",
"CLAPSimilarity": "0.271"
}
```