File size: 3,067 Bytes
07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c 07823f7 583e52c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
---
license: other
base_model: facebook/sam-audio-large
tags:
- onnx
- audio
- sam-audio
- source-separation
- audio-visual
---
# SAM-Audio ONNX (Large)
ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-large) - Meta's Semantic Audio Modeling for audio source separation.
This repository contains both **FP32** and **FP16** versions of the models.
## Model Variants
| Variant | DiT Size | Total Size | Notes |
|---------|----------|------------|-------|
| `fp32/` | 11.76 GB | ~13.9 GB | Full precision |
| `fp16/` | 5.88 GB | ~8.0 GB | Half precision (recommended) |
## Model Files (per variant)
| File | Description | FP32 Size | FP16 Size |
|------|-------------|-----------|-----------|
| `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | 110 MB | 110 MB |
| `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | 320 MB | 320 MB |
| `t5_encoder.onnx` | Text encoder (T5-base) | 440 MB | 440 MB |
| `dit_single_step.onnx` | DiT denoiser (3B params) | 11.76 GB | 5.88 GB |
| `vision_encoder.onnx` | Vision encoder (CLIP-based) | 1.27 GB | 1.27 GB |
| `tokenizer/` | SentencePiece tokenizer files | - | - |
## Installation
```bash
pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile
# For CUDA support (recommended for large model):
pip install onnxruntime-gpu
```
## Usage
### Using FP16 Models (Recommended)
```bash
python onnx_inference.py \
--video input.mp4 \
--text "a person speaking" \
--model-dir fp16 \
--output target.wav \
--output-residual residual.wav
```
### Using FP32 Models
```bash
python onnx_inference.py \
--video input.mp4 \
--text "keyboard typing" \
--model-dir fp32 \
--output target.wav
```
### Audio-Only Mode
```bash
python onnx_inference.py \
--audio input.wav \
--text "drums" \
--model-dir fp16 \
--output drums.wav
```
## Model Specifications
- **Audio Sample Rate**: 48kHz
- **Audio Hop Length**: 1536 samples
- **Vision Input Size**: 336×336 pixels
- **Text Encoder**: T5-base (768-dim)
- **Vision Encoder**: PE-Core-L14-336 (1024-dim)
- **DiT Parameters**: ~3 billion
- **ODE Solver**: Midpoint method (default 16 steps)
## Exporting Models
### Export FP16 DiT (Recommended)
```bash
python -m onnx_export.export_dit \
--output-dir ./my_models \
--model-id facebook/sam-audio-large \
--fp16 \
--device cuda
```
### Export Other Components
```bash
python -m onnx_export.export_dacvae --output-dir ./my_models --model-id facebook/sam-audio-large
python -m onnx_export.export_t5 --output-dir ./my_models --model-id facebook/sam-audio-large
python -m onnx_export.export_vision --model facebook/sam-audio-large --output ./my_models
```
## License
SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-large) for full terms.
## Acknowledgments
Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).
|