SAM-Audio ONNX (Large)

ONNX-converted models for SAM-Audio (facebook/sam-audio-large) - Meta's Semantic Audio Modeling for audio source separation.

This repository contains both FP32 and FP16 versions of the models.

Model Variants

Variant DiT Size Total Size Notes
fp32/ 11.76 GB ~13.9 GB Full precision
fp16/ 5.88 GB ~8.0 GB Half precision (recommended)

Model Files (per variant)

File Description FP32 Size FP16 Size
dacvae_encoder.onnx Audio encoder (48kHz โ†’ latent) 110 MB 110 MB
dacvae_decoder.onnx Audio decoder (latent โ†’ 48kHz) 320 MB 320 MB
t5_encoder.onnx Text encoder (T5-base) 440 MB 440 MB
dit_single_step.onnx DiT denoiser (3B params) 11.76 GB 5.88 GB
vision_encoder.onnx Vision encoder (CLIP-based) 1.27 GB 1.27 GB
tokenizer/ SentencePiece tokenizer files - -

Installation

pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile
# For CUDA support (recommended for large model):
pip install onnxruntime-gpu

Usage

Using FP16 Models (Recommended)

python onnx_inference.py \
    --video input.mp4 \
    --text "a person speaking" \
    --model-dir fp16 \
    --output target.wav \
    --output-residual residual.wav

Using FP32 Models

python onnx_inference.py \
    --video input.mp4 \
    --text "keyboard typing" \
    --model-dir fp32 \
    --output target.wav

Audio-Only Mode

python onnx_inference.py \
    --audio input.wav \
    --text "drums" \
    --model-dir fp16 \
    --output drums.wav

Model Specifications

  • Audio Sample Rate: 48kHz
  • Audio Hop Length: 1536 samples
  • Vision Input Size: 336ร—336 pixels
  • Text Encoder: T5-base (768-dim)
  • Vision Encoder: PE-Core-L14-336 (1024-dim)
  • DiT Parameters: ~3 billion
  • ODE Solver: Midpoint method (default 16 steps)

Exporting Models

Export FP16 DiT (Recommended)

python -m onnx_export.export_dit \
    --output-dir ./my_models \
    --model-id facebook/sam-audio-large \
    --fp16 \
    --device cuda

Export Other Components

python -m onnx_export.export_dacvae --output-dir ./my_models --model-id facebook/sam-audio-large
python -m onnx_export.export_t5 --output-dir ./my_models --model-id facebook/sam-audio-large
python -m onnx_export.export_vision --model facebook/sam-audio-large --output ./my_models

License

SAM-Audio is released under the CC-BY-NC 4.0 license. See original repository for full terms.

Acknowledgments

Original model by Meta AI Research.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for matbee/sam-audio-large-onnx

Quantized
(1)
this model