SAM-Audio ONNX (Large)
ONNX-converted models for SAM-Audio (facebook/sam-audio-large) - Meta's Semantic Audio Modeling for audio source separation.
This repository contains both FP32 and FP16 versions of the models.
Model Variants
| Variant | DiT Size | Total Size | Notes |
|---|---|---|---|
fp32/ |
11.76 GB | ~13.9 GB | Full precision |
fp16/ |
5.88 GB | ~8.0 GB | Half precision (recommended) |
Model Files (per variant)
| File | Description | FP32 Size | FP16 Size |
|---|---|---|---|
dacvae_encoder.onnx |
Audio encoder (48kHz โ latent) | 110 MB | 110 MB |
dacvae_decoder.onnx |
Audio decoder (latent โ 48kHz) | 320 MB | 320 MB |
t5_encoder.onnx |
Text encoder (T5-base) | 440 MB | 440 MB |
dit_single_step.onnx |
DiT denoiser (3B params) | 11.76 GB | 5.88 GB |
vision_encoder.onnx |
Vision encoder (CLIP-based) | 1.27 GB | 1.27 GB |
tokenizer/ |
SentencePiece tokenizer files | - | - |
Installation
pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile
# For CUDA support (recommended for large model):
pip install onnxruntime-gpu
Usage
Using FP16 Models (Recommended)
python onnx_inference.py \
--video input.mp4 \
--text "a person speaking" \
--model-dir fp16 \
--output target.wav \
--output-residual residual.wav
Using FP32 Models
python onnx_inference.py \
--video input.mp4 \
--text "keyboard typing" \
--model-dir fp32 \
--output target.wav
Audio-Only Mode
python onnx_inference.py \
--audio input.wav \
--text "drums" \
--model-dir fp16 \
--output drums.wav
Model Specifications
- Audio Sample Rate: 48kHz
- Audio Hop Length: 1536 samples
- Vision Input Size: 336ร336 pixels
- Text Encoder: T5-base (768-dim)
- Vision Encoder: PE-Core-L14-336 (1024-dim)
- DiT Parameters: ~3 billion
- ODE Solver: Midpoint method (default 16 steps)
Exporting Models
Export FP16 DiT (Recommended)
python -m onnx_export.export_dit \
--output-dir ./my_models \
--model-id facebook/sam-audio-large \
--fp16 \
--device cuda
Export Other Components
python -m onnx_export.export_dacvae --output-dir ./my_models --model-id facebook/sam-audio-large
python -m onnx_export.export_t5 --output-dir ./my_models --model-id facebook/sam-audio-large
python -m onnx_export.export_vision --model facebook/sam-audio-large --output ./my_models
License
SAM-Audio is released under the CC-BY-NC 4.0 license. See original repository for full terms.
Acknowledgments
Original model by Meta AI Research.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for matbee/sam-audio-large-onnx
Base model
facebook/sam-audio-large