--- license: other base_model: facebook/sam-audio-small tags: - onnx - audio - sam-audio - source-separation - audio-visual --- # SAM-Audio ONNX (Small) ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation. ## Model Files | File | Description | Size | |------|-------------|------| | `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | ~110 MB | | `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | ~320 MB | | `t5_encoder.onnx` | Text encoder (T5-base) | ~440 MB | | `dit_single_step.onnx` | DiT denoiser (single ODE step) | ~2 GB | | `vision_encoder.onnx` | Vision encoder (CLIP-based) | ~1.2 GB | | `peaframe.onnx` | PEAFrame span predictor (audio-text similarity) | ~5.8 GB | | `tokenizer/` | SentencePiece tokenizer files (T5) | - | | `peaframe_tokenizer/` | ModernBERT tokenizer files (PEAFrame) | - | | `peaframe_config.json` | PEAFrame scaling parameters | - | | `clap_audio_encoder.onnx` | CLAP audio encoder (HTSAT-tiny) | ~118 MB | | `clap_text_encoder.onnx` | CLAP text encoder (RoBERTa-base) | ~481 MB | | `clap_tokenizer/` | RoBERTa tokenizer files (CLAP) | - | | `clap_config.json` | CLAP audio preprocessing parameters | - | ## Installation ```bash pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers # For CUDA support: pip install onnxruntime-gpu ``` ## Usage Examples ### Audio-Only Separation ```bash python onnx_inference.py \ --audio input.wav \ --text "a person speaking" \ --output separated.wav ``` ### Video-Guided Separation ```bash python onnx_inference.py \ --video input.mp4 \ --text "the sound of typing" \ --output separated.wav ``` ### Automatic Span Prediction Use PEAFrame to automatically detect time spans matching your text description: ```bash python onnx_inference.py \ --audio input.wav \ --text "horn" \ --predict-spans \ --output separated.wav ``` This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments. ### Manual Anchors Specify exact time spans to focus on (positive anchors) or ignore (negative anchors): ```bash # Focus on specific time ranges python onnx_inference.py \ --audio input.wav \ --text "person speaking" \ --anchor + 4.5 7.0 \ --anchor + 12.0 15.5 \ --output separated.wav # Ignore specific time ranges python onnx_inference.py \ --audio input.wav \ --text "background music" \ --anchor - 0.0 3.0 \ --output separated.wav ``` ### CLAP Reranking Generate multiple candidates and select the best using CLAP audio-text similarity: ```bash python onnx_inference.py \ --audio input.wav \ --text "person speaking" \ --rerank \ --num-candidates 4 \ --output separated.wav ``` Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time. Options: - `--rerank` - Enable reranking mode - `--num-candidates N` - Number of candidates (default: 4) - `--rerank-seed SEED` - Random seed for reproducibility ### Visual Prompting with SAM3 Mask ```bash # First generate a mask with SAM3 (see generate_sam3_mask.py) python onnx_inference.py \ --video input.mp4 \ --mask object_mask.mp4 \ --text "" \ --output isolated.wav \ --output-video visualization.mp4 ``` ### Using a Custom Model Directory ```bash python onnx_inference.py \ --video input.mp4 \ --text "woman speaking" \ --model-dir ./my_onnx_models \ --output separated.wav ``` ## Model Specifications - **Audio Sample Rate**: 48kHz - **Audio Hop Length**: 1536 samples - **Vision Input Size**: 336×336 pixels - **Text Encoder**: T5-base (768-dim) - **Vision Encoder**: PE-Core-L14-336 (1024-dim) - **ODE Solver**: Midpoint method (configurable steps, default 16) - **PEAFrame**: Audio-text similarity model for span detection - Uses ModernBERT tokenizer - Processes audio in ~3.3s chunks with 50% overlap - Default threshold: 0.3 - **CLAP**: Audio-text similarity model for candidate reranking - Audio encoder: HTSAT-tiny - Text encoder: RoBERTa-base - Embedding dimension: 512 - Default candidates: 4 ## Exporting Models Export scripts are in the `onnx_export/` directory. ### Export All Models ```bash python -m onnx_export.export_all --output_dir ./onnx_models ``` ### Export Individual Components ```bash # DiT Transformer (supports FP16 for 50% size reduction) python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda # DACVAE (encoder + decoder) python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small # T5 Text Encoder python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small # Vision Encoder python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models # PEAFrame Span Predictor python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify # CLAP Reranking (audio + text encoders) python -m onnx_export.export_clap --output-dir ./onnx_models --verify ``` ### FP16 Quantization (for large models) For the large model (sam-audio-large), use `--fp16 --device cuda` during DiT export to reduce size by 50%: ```bash # Export DiT in FP16 (11.7GB → 5.9GB) python -m onnx_export.export_dit \ --output-dir ./onnx_models_large_fp16 \ --model-id facebook/sam-audio-large \ --fp16 \ --device cuda ``` The inference script automatically detects FP16 models and handles input conversion. ## Export Scripts Reference | Script | Description | |--------|-------------| | `export_all.py` | Export all components at once | | `export_dit.py` | DiT transformer with FP16 support | | `export_dacvae.py` | DACVAE encoder and decoder | | `export_t5.py` | T5 text encoder | | `export_vision.py` | Vision encoder (CLIP-based) | | `export_peaframe.py` | PEAFrame span predictor + tokenizer | | `export_clap.py` | CLAP audio + text encoders for reranking | | `standalone_config.py` | Config classes for standalone export | ## License SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-small) for full terms. ## Acknowledgments Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).