square-zero-labs's picture
Duplicate from matbee/sam-audio-small-onnx
3125aee
---
license: other
base_model: facebook/sam-audio-small
tags:
- onnx
- audio
- sam-audio
- source-separation
- audio-visual
---
# SAM-Audio ONNX (Small)
ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation.
## Model Files
| File | Description | Size |
|------|-------------|------|
| `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | ~110 MB |
| `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | ~320 MB |
| `t5_encoder.onnx` | Text encoder (T5-base) | ~440 MB |
| `dit_single_step.onnx` | DiT denoiser (single ODE step) | ~2 GB |
| `vision_encoder.onnx` | Vision encoder (CLIP-based) | ~1.2 GB |
| `peaframe.onnx` | PEAFrame span predictor (audio-text similarity) | ~5.8 GB |
| `tokenizer/` | SentencePiece tokenizer files (T5) | - |
| `peaframe_tokenizer/` | ModernBERT tokenizer files (PEAFrame) | - |
| `peaframe_config.json` | PEAFrame scaling parameters | - |
| `clap_audio_encoder.onnx` | CLAP audio encoder (HTSAT-tiny) | ~118 MB |
| `clap_text_encoder.onnx` | CLAP text encoder (RoBERTa-base) | ~481 MB |
| `clap_tokenizer/` | RoBERTa tokenizer files (CLAP) | - |
| `clap_config.json` | CLAP audio preprocessing parameters | - |
## Installation
```bash
pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers
# For CUDA support:
pip install onnxruntime-gpu
```
## Usage Examples
### Audio-Only Separation
```bash
python onnx_inference.py \
--audio input.wav \
--text "a person speaking" \
--output separated.wav
```
### Video-Guided Separation
```bash
python onnx_inference.py \
--video input.mp4 \
--text "the sound of typing" \
--output separated.wav
```
### Automatic Span Prediction
Use PEAFrame to automatically detect time spans matching your text description:
```bash
python onnx_inference.py \
--audio input.wav \
--text "horn" \
--predict-spans \
--output separated.wav
```
This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments.
### Manual Anchors
Specify exact time spans to focus on (positive anchors) or ignore (negative anchors):
```bash
# Focus on specific time ranges
python onnx_inference.py \
--audio input.wav \
--text "person speaking" \
--anchor + 4.5 7.0 \
--anchor + 12.0 15.5 \
--output separated.wav
# Ignore specific time ranges
python onnx_inference.py \
--audio input.wav \
--text "background music" \
--anchor - 0.0 3.0 \
--output separated.wav
```
### CLAP Reranking
Generate multiple candidates and select the best using CLAP audio-text similarity:
```bash
python onnx_inference.py \
--audio input.wav \
--text "person speaking" \
--rerank \
--num-candidates 4 \
--output separated.wav
```
Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time.
Options:
- `--rerank` - Enable reranking mode
- `--num-candidates N` - Number of candidates (default: 4)
- `--rerank-seed SEED` - Random seed for reproducibility
### Visual Prompting with SAM3 Mask
```bash
# First generate a mask with SAM3 (see generate_sam3_mask.py)
python onnx_inference.py \
--video input.mp4 \
--mask object_mask.mp4 \
--text "" \
--output isolated.wav \
--output-video visualization.mp4
```
### Using a Custom Model Directory
```bash
python onnx_inference.py \
--video input.mp4 \
--text "woman speaking" \
--model-dir ./my_onnx_models \
--output separated.wav
```
## Model Specifications
- **Audio Sample Rate**: 48kHz
- **Audio Hop Length**: 1536 samples
- **Vision Input Size**: 336×336 pixels
- **Text Encoder**: T5-base (768-dim)
- **Vision Encoder**: PE-Core-L14-336 (1024-dim)
- **ODE Solver**: Midpoint method (configurable steps, default 16)
- **PEAFrame**: Audio-text similarity model for span detection
- Uses ModernBERT tokenizer
- Processes audio in ~3.3s chunks with 50% overlap
- Default threshold: 0.3
- **CLAP**: Audio-text similarity model for candidate reranking
- Audio encoder: HTSAT-tiny
- Text encoder: RoBERTa-base
- Embedding dimension: 512
- Default candidates: 4
## Exporting Models
Export scripts are in the `onnx_export/` directory.
### Export All Models
```bash
python -m onnx_export.export_all --output_dir ./onnx_models
```
### Export Individual Components
```bash
# DiT Transformer (supports FP16 for 50% size reduction)
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda
# DACVAE (encoder + decoder)
python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small
# T5 Text Encoder
python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small
# Vision Encoder
python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models
# PEAFrame Span Predictor
python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify
# CLAP Reranking (audio + text encoders)
python -m onnx_export.export_clap --output-dir ./onnx_models --verify
```
### FP16 Quantization (for large models)
For the large model (sam-audio-large), use `--fp16 --device cuda` during DiT export to reduce size by 50%:
```bash
# Export DiT in FP16 (11.7GB → 5.9GB)
python -m onnx_export.export_dit \
--output-dir ./onnx_models_large_fp16 \
--model-id facebook/sam-audio-large \
--fp16 \
--device cuda
```
The inference script automatically detects FP16 models and handles input conversion.
## Export Scripts Reference
| Script | Description |
|--------|-------------|
| `export_all.py` | Export all components at once |
| `export_dit.py` | DiT transformer with FP16 support |
| `export_dacvae.py` | DACVAE encoder and decoder |
| `export_t5.py` | T5 text encoder |
| `export_vision.py` | Vision encoder (CLIP-based) |
| `export_peaframe.py` | PEAFrame span predictor + tokenizer |
| `export_clap.py` | CLAP audio + text encoders for reranking |
| `standalone_config.py` | Config classes for standalone export |
## License
SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-small) for full terms.
## Acknowledgments
Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).