File size: 6,794 Bytes
3125aee | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | ---
license: other
base_model: facebook/sam-audio-small
tags:
- onnx
- audio
- sam-audio
- source-separation
- audio-visual
---
# SAM-Audio ONNX (Small)
ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation.
## Model Files
| File | Description | Size |
|------|-------------|------|
| `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | ~110 MB |
| `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | ~320 MB |
| `t5_encoder.onnx` | Text encoder (T5-base) | ~440 MB |
| `dit_single_step.onnx` | DiT denoiser (single ODE step) | ~2 GB |
| `vision_encoder.onnx` | Vision encoder (CLIP-based) | ~1.2 GB |
| `peaframe.onnx` | PEAFrame span predictor (audio-text similarity) | ~5.8 GB |
| `tokenizer/` | SentencePiece tokenizer files (T5) | - |
| `peaframe_tokenizer/` | ModernBERT tokenizer files (PEAFrame) | - |
| `peaframe_config.json` | PEAFrame scaling parameters | - |
| `clap_audio_encoder.onnx` | CLAP audio encoder (HTSAT-tiny) | ~118 MB |
| `clap_text_encoder.onnx` | CLAP text encoder (RoBERTa-base) | ~481 MB |
| `clap_tokenizer/` | RoBERTa tokenizer files (CLAP) | - |
| `clap_config.json` | CLAP audio preprocessing parameters | - |
## Installation
```bash
pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers
# For CUDA support:
pip install onnxruntime-gpu
```
## Usage Examples
### Audio-Only Separation
```bash
python onnx_inference.py \
--audio input.wav \
--text "a person speaking" \
--output separated.wav
```
### Video-Guided Separation
```bash
python onnx_inference.py \
--video input.mp4 \
--text "the sound of typing" \
--output separated.wav
```
### Automatic Span Prediction
Use PEAFrame to automatically detect time spans matching your text description:
```bash
python onnx_inference.py \
--audio input.wav \
--text "horn" \
--predict-spans \
--output separated.wav
```
This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments.
### Manual Anchors
Specify exact time spans to focus on (positive anchors) or ignore (negative anchors):
```bash
# Focus on specific time ranges
python onnx_inference.py \
--audio input.wav \
--text "person speaking" \
--anchor + 4.5 7.0 \
--anchor + 12.0 15.5 \
--output separated.wav
# Ignore specific time ranges
python onnx_inference.py \
--audio input.wav \
--text "background music" \
--anchor - 0.0 3.0 \
--output separated.wav
```
### CLAP Reranking
Generate multiple candidates and select the best using CLAP audio-text similarity:
```bash
python onnx_inference.py \
--audio input.wav \
--text "person speaking" \
--rerank \
--num-candidates 4 \
--output separated.wav
```
Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time.
Options:
- `--rerank` - Enable reranking mode
- `--num-candidates N` - Number of candidates (default: 4)
- `--rerank-seed SEED` - Random seed for reproducibility
### Visual Prompting with SAM3 Mask
```bash
# First generate a mask with SAM3 (see generate_sam3_mask.py)
python onnx_inference.py \
--video input.mp4 \
--mask object_mask.mp4 \
--text "" \
--output isolated.wav \
--output-video visualization.mp4
```
### Using a Custom Model Directory
```bash
python onnx_inference.py \
--video input.mp4 \
--text "woman speaking" \
--model-dir ./my_onnx_models \
--output separated.wav
```
## Model Specifications
- **Audio Sample Rate**: 48kHz
- **Audio Hop Length**: 1536 samples
- **Vision Input Size**: 336×336 pixels
- **Text Encoder**: T5-base (768-dim)
- **Vision Encoder**: PE-Core-L14-336 (1024-dim)
- **ODE Solver**: Midpoint method (configurable steps, default 16)
- **PEAFrame**: Audio-text similarity model for span detection
- Uses ModernBERT tokenizer
- Processes audio in ~3.3s chunks with 50% overlap
- Default threshold: 0.3
- **CLAP**: Audio-text similarity model for candidate reranking
- Audio encoder: HTSAT-tiny
- Text encoder: RoBERTa-base
- Embedding dimension: 512
- Default candidates: 4
## Exporting Models
Export scripts are in the `onnx_export/` directory.
### Export All Models
```bash
python -m onnx_export.export_all --output_dir ./onnx_models
```
### Export Individual Components
```bash
# DiT Transformer (supports FP16 for 50% size reduction)
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda
# DACVAE (encoder + decoder)
python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small
# T5 Text Encoder
python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small
# Vision Encoder
python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models
# PEAFrame Span Predictor
python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify
# CLAP Reranking (audio + text encoders)
python -m onnx_export.export_clap --output-dir ./onnx_models --verify
```
### FP16 Quantization (for large models)
For the large model (sam-audio-large), use `--fp16 --device cuda` during DiT export to reduce size by 50%:
```bash
# Export DiT in FP16 (11.7GB → 5.9GB)
python -m onnx_export.export_dit \
--output-dir ./onnx_models_large_fp16 \
--model-id facebook/sam-audio-large \
--fp16 \
--device cuda
```
The inference script automatically detects FP16 models and handles input conversion.
## Export Scripts Reference
| Script | Description |
|--------|-------------|
| `export_all.py` | Export all components at once |
| `export_dit.py` | DiT transformer with FP16 support |
| `export_dacvae.py` | DACVAE encoder and decoder |
| `export_t5.py` | T5 text encoder |
| `export_vision.py` | Vision encoder (CLIP-based) |
| `export_peaframe.py` | PEAFrame span predictor + tokenizer |
| `export_clap.py` | CLAP audio + text encoders for reranking |
| `standalone_config.py` | Config classes for standalone export |
## License
SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-small) for full terms.
## Acknowledgments
Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).
|