--- license: other base_model: facebook/sam-audio-large tags: - onnx - audio - sam-audio - source-separation - audio-visual --- # SAM-Audio ONNX (Large) ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-large) - Meta's Semantic Audio Modeling for audio source separation. This repository contains both **FP32** and **FP16** versions of the models. ## Model Variants | Variant | DiT Size | Total Size | Notes | |---------|----------|------------|-------| | `fp32/` | 11.76 GB | ~13.9 GB | Full precision | | `fp16/` | 5.88 GB | ~8.0 GB | Half precision (recommended) | ## Model Files (per variant) | File | Description | FP32 Size | FP16 Size | |------|-------------|-----------|-----------| | `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | 110 MB | 110 MB | | `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | 320 MB | 320 MB | | `t5_encoder.onnx` | Text encoder (T5-base) | 440 MB | 440 MB | | `dit_single_step.onnx` | DiT denoiser (3B params) | 11.76 GB | 5.88 GB | | `vision_encoder.onnx` | Vision encoder (CLIP-based) | 1.27 GB | 1.27 GB | | `tokenizer/` | SentencePiece tokenizer files | - | - | ## Installation ```bash pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile # For CUDA support (recommended for large model): pip install onnxruntime-gpu ``` ## Usage ### Using FP16 Models (Recommended) ```bash python onnx_inference.py \ --video input.mp4 \ --text "a person speaking" \ --model-dir fp16 \ --output target.wav \ --output-residual residual.wav ``` ### Using FP32 Models ```bash python onnx_inference.py \ --video input.mp4 \ --text "keyboard typing" \ --model-dir fp32 \ --output target.wav ``` ### Audio-Only Mode ```bash python onnx_inference.py \ --audio input.wav \ --text "drums" \ --model-dir fp16 \ --output drums.wav ``` ## Model Specifications - **Audio Sample Rate**: 48kHz - **Audio Hop Length**: 1536 samples - **Vision Input Size**: 336×336 pixels - **Text Encoder**: T5-base (768-dim) - **Vision Encoder**: PE-Core-L14-336 (1024-dim) - **DiT Parameters**: ~3 billion - **ODE Solver**: Midpoint method (default 16 steps) ## Exporting Models ### Export FP16 DiT (Recommended) ```bash python -m onnx_export.export_dit \ --output-dir ./my_models \ --model-id facebook/sam-audio-large \ --fp16 \ --device cuda ``` ### Export Other Components ```bash python -m onnx_export.export_dacvae --output-dir ./my_models --model-id facebook/sam-audio-large python -m onnx_export.export_t5 --output-dir ./my_models --model-id facebook/sam-audio-large python -m onnx_export.export_vision --model facebook/sam-audio-large --output ./my_models ``` ## License SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-large) for full terms. ## Acknowledgments Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).