matbee's picture
Upload folder using huggingface_hub
07823f7 verified
---
license: other
base_model: facebook/sam-audio-large
tags:
- onnx
- audio
- sam-audio
- source-separation
- audio-visual
---
# SAM-Audio ONNX (Large)
ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-large) - Meta's Semantic Audio Modeling for audio source separation.
This repository contains both **FP32** and **FP16** versions of the models.
## Model Variants
| Variant | DiT Size | Total Size | Notes |
|---------|----------|------------|-------|
| `fp32/` | 11.76 GB | ~13.9 GB | Full precision |
| `fp16/` | 5.88 GB | ~8.0 GB | Half precision (recommended) |
## Model Files (per variant)
| File | Description | FP32 Size | FP16 Size |
|------|-------------|-----------|-----------|
| `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | 110 MB | 110 MB |
| `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | 320 MB | 320 MB |
| `t5_encoder.onnx` | Text encoder (T5-base) | 440 MB | 440 MB |
| `dit_single_step.onnx` | DiT denoiser (3B params) | 11.76 GB | 5.88 GB |
| `vision_encoder.onnx` | Vision encoder (CLIP-based) | 1.27 GB | 1.27 GB |
| `tokenizer/` | SentencePiece tokenizer files | - | - |
## Installation
```bash
pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile
# For CUDA support (recommended for large model):
pip install onnxruntime-gpu
```
## Usage
### Using FP16 Models (Recommended)
```bash
python onnx_inference.py \
--video input.mp4 \
--text "a person speaking" \
--model-dir fp16 \
--output target.wav \
--output-residual residual.wav
```
### Using FP32 Models
```bash
python onnx_inference.py \
--video input.mp4 \
--text "keyboard typing" \
--model-dir fp32 \
--output target.wav
```
### Audio-Only Mode
```bash
python onnx_inference.py \
--audio input.wav \
--text "drums" \
--model-dir fp16 \
--output drums.wav
```
## Model Specifications
- **Audio Sample Rate**: 48kHz
- **Audio Hop Length**: 1536 samples
- **Vision Input Size**: 336×336 pixels
- **Text Encoder**: T5-base (768-dim)
- **Vision Encoder**: PE-Core-L14-336 (1024-dim)
- **DiT Parameters**: ~3 billion
- **ODE Solver**: Midpoint method (default 16 steps)
## Exporting Models
### Export FP16 DiT (Recommended)
```bash
python -m onnx_export.export_dit \
--output-dir ./my_models \
--model-id facebook/sam-audio-large \
--fp16 \
--device cuda
```
### Export Other Components
```bash
python -m onnx_export.export_dacvae --output-dir ./my_models --model-id facebook/sam-audio-large
python -m onnx_export.export_t5 --output-dir ./my_models --model-id facebook/sam-audio-large
python -m onnx_export.export_vision --model facebook/sam-audio-large --output ./my_models
```
## License
SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-large) for full terms.
## Acknowledgments
Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).