Spaces:

ray-006
/

Sample-Audio

Running on Zero

App Files Files Community

Sample-Audio / eval /README.md

ray-006

Upload 43 files

fc605f9 verified 1 day ago

preview code

raw

history blame contribute delete

2.49 kB

Evaluation

This directory contains the evaluation code to reproduce the results from the SAM-Audio paper. The evaluation framework supports multiple datasets, prompting modes (text-only, span, visual), and metrics.

Setup

Before running evaluation, ensure you have:

Installed the SAM-Audio package and its dependencies
Authenticated with Hugging Face to access the model checkpoints (see main README)

Quick Start

Run evaluation on the default setting (instr-pro):

python main.py

You can also use multiple GPUs to speed up evaluation:

torchrun --nproc_per_node=<ngpus> python main.py

Evaluate on a specific setting:

python main.py --setting sfx

Evaluate on multiple settings:

python main.py --setting sfx speech music

Available Evaluation Settings

Run python main.py --help to see all available settings

Command Line Options

python main.py [OPTIONS]

Options:

-s, --setting - Which setting(s) to evaluate (default: instr-pro)
- Choices: See available settings above
- Can specify multiple settings: --setting sfx speech music
--cache-path - Where to cache downloaded datasets (default: ~/.cache/sam_audio)
-p, --checkpoint-path - Model checkpoint to evaluate (default: facebook/sam-audio-1b)
- Can use local path or Hugging Face model ID
-b, --batch-size - Batch size for evaluation (default: 1)
-w, --num-workers - Number of data loading workers (default: 4)
-c, --candidates - Number of reranking candidates (default: 8)

Evaluation Metrics

The evaluation framework computes the following metrics:

Judge - SAM Audio Judge quality assessment metric
Aesthetic - Aesthetic quality metric
CLAP - Audio-text alignment metric (CLAP similarity)
ImageBind - Audio-video alignment metric (for visual settings only)

Output

Results are saved to the results/ directory as JSON files, one per setting:

results/
├── sfx.json
├── speech.json
└── music.json

Each JSON file contains the averaged metric scores across all samples in that setting.

Example output:

{
    "JudgeOverall": "4.386",
    "JudgeFaithfulness": "4.708",
    "JudgeRecall": "4.934",
    "JudgePrecision": "4.451",
    "ContentEnjoyment": "5.296",
    "ContentUsefulness": "6.903",
    "ProductionComplexity": "4.301",
    "ProductionQuality": "7.100",
    "CLAPSimilarity": "0.271"
}