| | --- |
| | license: other |
| | base_model: facebook/sam-audio-small |
| | tags: |
| | - onnx |
| | - audio |
| | - sam-audio |
| | - source-separation |
| | - audio-visual |
| | --- |
| | |
| | # SAM-Audio ONNX (Small) |
| |
|
| | ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation. |
| |
|
| | ## Model Files |
| |
|
| | | File | Description | Size | |
| | |------|-------------|------| |
| | | `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | ~110 MB | |
| | | `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | ~320 MB | |
| | | `t5_encoder.onnx` | Text encoder (T5-base) | ~440 MB | |
| | | `dit_single_step.onnx` | DiT denoiser (single ODE step) | ~2 GB | |
| | | `vision_encoder.onnx` | Vision encoder (CLIP-based) | ~1.2 GB | |
| | | `peaframe.onnx` | PEAFrame span predictor (audio-text similarity) | ~5.8 GB | |
| | | `tokenizer/` | SentencePiece tokenizer files (T5) | - | |
| | | `peaframe_tokenizer/` | ModernBERT tokenizer files (PEAFrame) | - | |
| | | `peaframe_config.json` | PEAFrame scaling parameters | - | |
| | | `clap_audio_encoder.onnx` | CLAP audio encoder (HTSAT-tiny) | ~118 MB | |
| | | `clap_text_encoder.onnx` | CLAP text encoder (RoBERTa-base) | ~481 MB | |
| | | `clap_tokenizer/` | RoBERTa tokenizer files (CLAP) | - | |
| | | `clap_config.json` | CLAP audio preprocessing parameters | - | |
| |
|
| | ## Installation |
| |
|
| | ```bash |
| | pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers |
| | # For CUDA support: |
| | pip install onnxruntime-gpu |
| | ``` |
| |
|
| | ## Usage Examples |
| |
|
| | ### Audio-Only Separation |
| | ```bash |
| | python onnx_inference.py \ |
| | --audio input.wav \ |
| | --text "a person speaking" \ |
| | --output separated.wav |
| | ``` |
| |
|
| | ### Video-Guided Separation |
| | ```bash |
| | python onnx_inference.py \ |
| | --video input.mp4 \ |
| | --text "the sound of typing" \ |
| | --output separated.wav |
| | ``` |
| |
|
| | ### Automatic Span Prediction |
| | Use PEAFrame to automatically detect time spans matching your text description: |
| | ```bash |
| | python onnx_inference.py \ |
| | --audio input.wav \ |
| | --text "horn" \ |
| | --predict-spans \ |
| | --output separated.wav |
| | ``` |
| |
|
| | This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments. |
| |
|
| | ### Manual Anchors |
| | Specify exact time spans to focus on (positive anchors) or ignore (negative anchors): |
| | ```bash |
| | # Focus on specific time ranges |
| | python onnx_inference.py \ |
| | --audio input.wav \ |
| | --text "person speaking" \ |
| | --anchor + 4.5 7.0 \ |
| | --anchor + 12.0 15.5 \ |
| | --output separated.wav |
| | |
| | # Ignore specific time ranges |
| | python onnx_inference.py \ |
| | --audio input.wav \ |
| | --text "background music" \ |
| | --anchor - 0.0 3.0 \ |
| | --output separated.wav |
| | ``` |
| |
|
| | ### CLAP Reranking |
| | Generate multiple candidates and select the best using CLAP audio-text similarity: |
| | ```bash |
| | python onnx_inference.py \ |
| | --audio input.wav \ |
| | --text "person speaking" \ |
| | --rerank \ |
| | --num-candidates 4 \ |
| | --output separated.wav |
| | ``` |
| |
|
| | Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time. |
| |
|
| | Options: |
| | - `--rerank` - Enable reranking mode |
| | - `--num-candidates N` - Number of candidates (default: 4) |
| | - `--rerank-seed SEED` - Random seed for reproducibility |
| |
|
| | ### Visual Prompting with SAM3 Mask |
| | ```bash |
| | # First generate a mask with SAM3 (see generate_sam3_mask.py) |
| | python onnx_inference.py \ |
| | --video input.mp4 \ |
| | --mask object_mask.mp4 \ |
| | --text "" \ |
| | --output isolated.wav \ |
| | --output-video visualization.mp4 |
| | ``` |
| |
|
| | ### Using a Custom Model Directory |
| | ```bash |
| | python onnx_inference.py \ |
| | --video input.mp4 \ |
| | --text "woman speaking" \ |
| | --model-dir ./my_onnx_models \ |
| | --output separated.wav |
| | ``` |
| |
|
| | ## Model Specifications |
| |
|
| | - **Audio Sample Rate**: 48kHz |
| | - **Audio Hop Length**: 1536 samples |
| | - **Vision Input Size**: 336×336 pixels |
| | - **Text Encoder**: T5-base (768-dim) |
| | - **Vision Encoder**: PE-Core-L14-336 (1024-dim) |
| | - **ODE Solver**: Midpoint method (configurable steps, default 16) |
| | - **PEAFrame**: Audio-text similarity model for span detection |
| | - Uses ModernBERT tokenizer |
| | - Processes audio in ~3.3s chunks with 50% overlap |
| | - Default threshold: 0.3 |
| | - **CLAP**: Audio-text similarity model for candidate reranking |
| | - Audio encoder: HTSAT-tiny |
| | - Text encoder: RoBERTa-base |
| | - Embedding dimension: 512 |
| | - Default candidates: 4 |
| |
|
| | ## Exporting Models |
| |
|
| | Export scripts are in the `onnx_export/` directory. |
| |
|
| | ### Export All Models |
| | ```bash |
| | python -m onnx_export.export_all --output_dir ./onnx_models |
| | ``` |
| |
|
| | ### Export Individual Components |
| | ```bash |
| | # DiT Transformer (supports FP16 for 50% size reduction) |
| | python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small |
| | python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda |
| | |
| | # DACVAE (encoder + decoder) |
| | python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small |
| | |
| | # T5 Text Encoder |
| | python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small |
| | |
| | # Vision Encoder |
| | python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models |
| | |
| | # PEAFrame Span Predictor |
| | python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify |
| | |
| | # CLAP Reranking (audio + text encoders) |
| | python -m onnx_export.export_clap --output-dir ./onnx_models --verify |
| | ``` |
| |
|
| | ### FP16 Quantization (for large models) |
| |
|
| | For the large model (sam-audio-large), use `--fp16 --device cuda` during DiT export to reduce size by 50%: |
| |
|
| | ```bash |
| | # Export DiT in FP16 (11.7GB → 5.9GB) |
| | python -m onnx_export.export_dit \ |
| | --output-dir ./onnx_models_large_fp16 \ |
| | --model-id facebook/sam-audio-large \ |
| | --fp16 \ |
| | --device cuda |
| | ``` |
| |
|
| | The inference script automatically detects FP16 models and handles input conversion. |
| |
|
| | ## Export Scripts Reference |
| |
|
| | | Script | Description | |
| | |--------|-------------| |
| | | `export_all.py` | Export all components at once | |
| | | `export_dit.py` | DiT transformer with FP16 support | |
| | | `export_dacvae.py` | DACVAE encoder and decoder | |
| | | `export_t5.py` | T5 text encoder | |
| | | `export_vision.py` | Vision encoder (CLIP-based) | |
| | | `export_peaframe.py` | PEAFrame span predictor + tokenizer | |
| | | `export_clap.py` | CLAP audio + text encoders for reranking | |
| | | `standalone_config.py` | Config classes for standalone export | |
| |
|
| | ## License |
| |
|
| | SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-small) for full terms. |
| |
|
| | ## Acknowledgments |
| |
|
| | Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio). |
| |
|
| |
|