File size: 6,794 Bytes
3125aee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
license: other
base_model: facebook/sam-audio-small
tags:
- onnx
- audio
- sam-audio
- source-separation
- audio-visual
---

# SAM-Audio ONNX (Small)

ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation.

## Model Files

| File | Description | Size |
|------|-------------|------|
| `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | ~110 MB |
| `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | ~320 MB |
| `t5_encoder.onnx` | Text encoder (T5-base) | ~440 MB |
| `dit_single_step.onnx` | DiT denoiser (single ODE step) | ~2 GB |
| `vision_encoder.onnx` | Vision encoder (CLIP-based) | ~1.2 GB |
| `peaframe.onnx` | PEAFrame span predictor (audio-text similarity) | ~5.8 GB |
| `tokenizer/` | SentencePiece tokenizer files (T5) | - |
| `peaframe_tokenizer/` | ModernBERT tokenizer files (PEAFrame) | - |
| `peaframe_config.json` | PEAFrame scaling parameters | - |
| `clap_audio_encoder.onnx` | CLAP audio encoder (HTSAT-tiny) | ~118 MB |
| `clap_text_encoder.onnx` | CLAP text encoder (RoBERTa-base) | ~481 MB |
| `clap_tokenizer/` | RoBERTa tokenizer files (CLAP) | - |
| `clap_config.json` | CLAP audio preprocessing parameters | - |

## Installation

```bash
pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers
# For CUDA support:
pip install onnxruntime-gpu
```

## Usage Examples

### Audio-Only Separation
```bash
python onnx_inference.py \
    --audio input.wav \
    --text "a person speaking" \
    --output separated.wav
```

### Video-Guided Separation
```bash
python onnx_inference.py \
    --video input.mp4 \
    --text "the sound of typing" \
    --output separated.wav
```

### Automatic Span Prediction
Use PEAFrame to automatically detect time spans matching your text description:
```bash
python onnx_inference.py \
    --audio input.wav \
    --text "horn" \
    --predict-spans \
    --output separated.wav
```

This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments.

### Manual Anchors
Specify exact time spans to focus on (positive anchors) or ignore (negative anchors):
```bash
# Focus on specific time ranges
python onnx_inference.py \
    --audio input.wav \
    --text "person speaking" \
    --anchor + 4.5 7.0 \
    --anchor + 12.0 15.5 \
    --output separated.wav

# Ignore specific time ranges
python onnx_inference.py \
    --audio input.wav \
    --text "background music" \
    --anchor - 0.0 3.0 \
    --output separated.wav
```

### CLAP Reranking
Generate multiple candidates and select the best using CLAP audio-text similarity:
```bash
python onnx_inference.py \
    --audio input.wav \
    --text "person speaking" \
    --rerank \
    --num-candidates 4 \
    --output separated.wav
```

Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time.

Options:
- `--rerank` - Enable reranking mode
- `--num-candidates N` - Number of candidates (default: 4)
- `--rerank-seed SEED` - Random seed for reproducibility

### Visual Prompting with SAM3 Mask
```bash
# First generate a mask with SAM3 (see generate_sam3_mask.py)
python onnx_inference.py \
    --video input.mp4 \
    --mask object_mask.mp4 \
    --text "" \
    --output isolated.wav \
    --output-video visualization.mp4
```

### Using a Custom Model Directory
```bash
python onnx_inference.py \
    --video input.mp4 \
    --text "woman speaking" \
    --model-dir ./my_onnx_models \
    --output separated.wav
```

## Model Specifications

- **Audio Sample Rate**: 48kHz
- **Audio Hop Length**: 1536 samples
- **Vision Input Size**: 336×336 pixels
- **Text Encoder**: T5-base (768-dim)
- **Vision Encoder**: PE-Core-L14-336 (1024-dim)
- **ODE Solver**: Midpoint method (configurable steps, default 16)
- **PEAFrame**: Audio-text similarity model for span detection
  - Uses ModernBERT tokenizer
  - Processes audio in ~3.3s chunks with 50% overlap
  - Default threshold: 0.3
- **CLAP**: Audio-text similarity model for candidate reranking
  - Audio encoder: HTSAT-tiny
  - Text encoder: RoBERTa-base
  - Embedding dimension: 512
  - Default candidates: 4

## Exporting Models

Export scripts are in the `onnx_export/` directory.

### Export All Models
```bash
python -m onnx_export.export_all --output_dir ./onnx_models
```

### Export Individual Components
```bash
# DiT Transformer (supports FP16 for 50% size reduction)
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda

# DACVAE (encoder + decoder)
python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small

# T5 Text Encoder
python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small

# Vision Encoder
python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models

# PEAFrame Span Predictor
python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify

# CLAP Reranking (audio + text encoders)
python -m onnx_export.export_clap --output-dir ./onnx_models --verify
```

### FP16 Quantization (for large models)

For the large model (sam-audio-large), use `--fp16 --device cuda` during DiT export to reduce size by 50%:

```bash
# Export DiT in FP16 (11.7GB → 5.9GB)
python -m onnx_export.export_dit \
    --output-dir ./onnx_models_large_fp16 \
    --model-id facebook/sam-audio-large \
    --fp16 \
    --device cuda
```

The inference script automatically detects FP16 models and handles input conversion.

## Export Scripts Reference

| Script | Description |
|--------|-------------|
| `export_all.py` | Export all components at once |
| `export_dit.py` | DiT transformer with FP16 support |
| `export_dacvae.py` | DACVAE encoder and decoder |
| `export_t5.py` | T5 text encoder |
| `export_vision.py` | Vision encoder (CLIP-based) |
| `export_peaframe.py` | PEAFrame span predictor + tokenizer |
| `export_clap.py` | CLAP audio + text encoders for reranking |
| `standalone_config.py` | Config classes for standalone export |

## License

SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-small) for full terms.

## Acknowledgments

Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).