File size: 6,794 Bytes

3125aee

---
license: other
base_model: facebook/sam-audio-small
tags:
- onnx
- audio
- sam-audio
- source-separation
- audio-visual
---

# SAM-Audio ONNX (Small)

ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation.

## Model Files

| File | Description | Size |
|------|-------------|------|
| `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | ~110 MB |
| `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | ~320 MB |
| `t5_encoder.onnx` | Text encoder (T5-base) | ~440 MB |
| `dit_single_step.onnx` | DiT denoiser (single ODE step) | ~2 GB |
| `vision_encoder.onnx` | Vision encoder (CLIP-based) | ~1.2 GB |
| `peaframe.onnx` | PEAFrame span predictor (audio-text similarity) | ~5.8 GB |
| `tokenizer/` | SentencePiece tokenizer files (T5) | - |
| `peaframe_tokenizer/` | ModernBERT tokenizer files (PEAFrame) | - |
| `peaframe_config.json` | PEAFrame scaling parameters | - |
| `clap_audio_encoder.onnx` | CLAP audio encoder (HTSAT-tiny) | ~118 MB |
| `clap_text_encoder.onnx` | CLAP text encoder (RoBERTa-base) | ~481 MB |
| `clap_tokenizer/` | RoBERTa tokenizer files (CLAP) | - |
| `clap_config.json` | CLAP audio preprocessing parameters | - |

## Installation

```bash
pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers
# For CUDA support:
pip install onnxruntime-gpu
```

## Usage Examples

### Audio-Only Separation
```bash
python onnx_inference.py \
    --audio input.wav \
    --text "a person speaking" \
    --output separated.wav
```

### Video-Guided Separation
```bash
python onnx_inference.py \
    --video input.mp4 \
    --text "the sound of typing" \
    --output separated.wav
```

### Automatic Span Prediction
Use PEAFrame to automatically detect time spans matching your text description:
```bash
python onnx_inference.py \
    --audio input.wav \
    --text "horn" \
    --predict-spans \
    --output separated.wav
```

This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments.

### Manual Anchors
Specify exact time spans to focus on (positive anchors) or ignore (negative anchors):
```bash
# Focus on specific time ranges
python onnx_inference.py \
    --audio input.wav \
    --text "person speaking" \
    --anchor + 4.5 7.0 \
    --anchor + 12.0 15.5 \
    --output separated.wav

# Ignore specific time ranges
python onnx_inference.py \
    --audio input.wav \
    --text "background music" \
    --anchor - 0.0 3.0 \
    --output separated.wav
```

### CLAP Reranking
Generate multiple candidates and select the best using CLAP audio-text similarity:
```bash
python onnx_inference.py \
    --audio input.wav \
    --text "person speaking" \
    --rerank \
    --num-candidates 4 \
    --output separated.wav
```

Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time.

Options:
- `--rerank` - Enable reranking mode
- `--num-candidates N` - Number of candidates (default: 4)
- `--rerank-seed SEED` - Random seed for reproducibility

### Visual Prompting with SAM3 Mask
```bash
# First generate a mask with SAM3 (see generate_sam3_mask.py)
python onnx_inference.py \
    --video input.mp4 \
    --mask object_mask.mp4 \
    --text "" \
    --output isolated.wav \
    --output-video visualization.mp4
```

### Using a Custom Model Directory
```bash
python onnx_inference.py \
    --video input.mp4 \
    --text "woman speaking" \
    --model-dir ./my_onnx_models \
    --output separated.wav
```

## Model Specifications

- **Audio Sample Rate**: 48kHz
- **Audio Hop Length**: 1536 samples
- **Vision Input Size**: 336×336 pixels
- **Text Encoder**: T5-base (768-dim)
- **Vision Encoder**: PE-Core-L14-336 (1024-dim)
- **ODE Solver**: Midpoint method (configurable steps, default 16)
- **PEAFrame**: Audio-text similarity model for span detection
  - Uses ModernBERT tokenizer
  - Processes audio in ~3.3s chunks with 50% overlap
  - Default threshold: 0.3
- **CLAP**: Audio-text similarity model for candidate reranking
  - Audio encoder: HTSAT-tiny
  - Text encoder: RoBERTa-base
  - Embedding dimension: 512
  - Default candidates: 4

## Exporting Models

Export scripts are in the `onnx_export/` directory.

### Export All Models
```bash
python -m onnx_export.export_all --output_dir ./onnx_models
```

### Export Individual Components
```bash
# DiT Transformer (supports FP16 for 50% size reduction)
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda

# DACVAE (encoder + decoder)
python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small

# T5 Text Encoder
python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small

# Vision Encoder
python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models

# PEAFrame Span Predictor
python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify

# CLAP Reranking (audio + text encoders)
python -m onnx_export.export_clap --output-dir ./onnx_models --verify
```

### FP16 Quantization (for large models)

For the large model (sam-audio-large), use `--fp16 --device cuda` during DiT export to reduce size by 50%:

```bash
# Export DiT in FP16 (11.7GB → 5.9GB)
python -m onnx_export.export_dit \
    --output-dir ./onnx_models_large_fp16 \
    --model-id facebook/sam-audio-large \
    --fp16 \
    --device cuda
```

The inference script automatically detects FP16 models and handles input conversion.

## Export Scripts Reference

| Script | Description |
|--------|-------------|
| `export_all.py` | Export all components at once |
| `export_dit.py` | DiT transformer with FP16 support |
| `export_dacvae.py` | DACVAE encoder and decoder |
| `export_t5.py` | T5 text encoder |
| `export_vision.py` | Vision encoder (CLIP-based) |
| `export_peaframe.py` | PEAFrame span predictor + tokenizer |
| `export_clap.py` | CLAP audio + text encoders for reranking |
| `standalone_config.py` | Config classes for standalone export |

## License

SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-small) for full terms.

## Acknowledgments

Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).