Sample-Audio / eval /README.md
ray-006's picture
Upload 43 files
fc605f9 verified

Evaluation

This directory contains the evaluation code to reproduce the results from the SAM-Audio paper. The evaluation framework supports multiple datasets, prompting modes (text-only, span, visual), and metrics.

Setup

Before running evaluation, ensure you have:

  1. Installed the SAM-Audio package and its dependencies
  2. Authenticated with Hugging Face to access the model checkpoints (see main README)

Quick Start

Run evaluation on the default setting (instr-pro):

python main.py

You can also use multiple GPUs to speed up evaluation:

torchrun --nproc_per_node=<ngpus> python main.py

Evaluate on a specific setting:

python main.py --setting sfx

Evaluate on multiple settings:

python main.py --setting sfx speech music

Available Evaluation Settings

Run python main.py --help to see all available settings

Command Line Options

python main.py [OPTIONS]

Options:

  • -s, --setting - Which setting(s) to evaluate (default: instr-pro)

    • Choices: See available settings above
    • Can specify multiple settings: --setting sfx speech music
  • --cache-path - Where to cache downloaded datasets (default: ~/.cache/sam_audio)

  • -p, --checkpoint-path - Model checkpoint to evaluate (default: facebook/sam-audio-1b)

    • Can use local path or Hugging Face model ID
  • -b, --batch-size - Batch size for evaluation (default: 1)

  • -w, --num-workers - Number of data loading workers (default: 4)

  • -c, --candidates - Number of reranking candidates (default: 8)

Evaluation Metrics

The evaluation framework computes the following metrics:

  • Judge - SAM Audio Judge quality assessment metric
  • Aesthetic - Aesthetic quality metric
  • CLAP - Audio-text alignment metric (CLAP similarity)
  • ImageBind - Audio-video alignment metric (for visual settings only)

Output

Results are saved to the results/ directory as JSON files, one per setting:

results/
β”œβ”€β”€ sfx.json
β”œβ”€β”€ speech.json
└── music.json

Each JSON file contains the averaged metric scores across all samples in that setting.

Example output:

{
    "JudgeOverall": "4.386",
    "JudgeFaithfulness": "4.708",
    "JudgeRecall": "4.934",
    "JudgePrecision": "4.451",
    "ContentEnjoyment": "5.296",
    "ContentUsefulness": "6.903",
    "ProductionComplexity": "4.301",
    "ProductionQuality": "7.100",
    "CLAPSimilarity": "0.271"
}