|
|
--- |
|
|
license: other |
|
|
base_model: facebook/sam-audio-large |
|
|
tags: |
|
|
- onnx |
|
|
- audio |
|
|
- sam-audio |
|
|
- source-separation |
|
|
- audio-visual |
|
|
--- |
|
|
|
|
|
# SAM-Audio ONNX (Large) |
|
|
|
|
|
ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-large) - Meta's Semantic Audio Modeling for audio source separation. |
|
|
|
|
|
This repository contains both **FP32** and **FP16** versions of the models. |
|
|
|
|
|
## Model Variants |
|
|
|
|
|
| Variant | DiT Size | Total Size | Notes | |
|
|
|---------|----------|------------|-------| |
|
|
| `fp32/` | 11.76 GB | ~13.9 GB | Full precision | |
|
|
| `fp16/` | 5.88 GB | ~8.0 GB | Half precision (recommended) | |
|
|
|
|
|
## Model Files (per variant) |
|
|
|
|
|
| File | Description | FP32 Size | FP16 Size | |
|
|
|------|-------------|-----------|-----------| |
|
|
| `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | 110 MB | 110 MB | |
|
|
| `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | 320 MB | 320 MB | |
|
|
| `t5_encoder.onnx` | Text encoder (T5-base) | 440 MB | 440 MB | |
|
|
| `dit_single_step.onnx` | DiT denoiser (3B params) | 11.76 GB | 5.88 GB | |
|
|
| `vision_encoder.onnx` | Vision encoder (CLIP-based) | 1.27 GB | 1.27 GB | |
|
|
| `tokenizer/` | SentencePiece tokenizer files | - | - | |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile |
|
|
# For CUDA support (recommended for large model): |
|
|
pip install onnxruntime-gpu |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using FP16 Models (Recommended) |
|
|
```bash |
|
|
python onnx_inference.py \ |
|
|
--video input.mp4 \ |
|
|
--text "a person speaking" \ |
|
|
--model-dir fp16 \ |
|
|
--output target.wav \ |
|
|
--output-residual residual.wav |
|
|
``` |
|
|
|
|
|
### Using FP32 Models |
|
|
```bash |
|
|
python onnx_inference.py \ |
|
|
--video input.mp4 \ |
|
|
--text "keyboard typing" \ |
|
|
--model-dir fp32 \ |
|
|
--output target.wav |
|
|
``` |
|
|
|
|
|
### Audio-Only Mode |
|
|
```bash |
|
|
python onnx_inference.py \ |
|
|
--audio input.wav \ |
|
|
--text "drums" \ |
|
|
--model-dir fp16 \ |
|
|
--output drums.wav |
|
|
``` |
|
|
|
|
|
## Model Specifications |
|
|
|
|
|
- **Audio Sample Rate**: 48kHz |
|
|
- **Audio Hop Length**: 1536 samples |
|
|
- **Vision Input Size**: 336×336 pixels |
|
|
- **Text Encoder**: T5-base (768-dim) |
|
|
- **Vision Encoder**: PE-Core-L14-336 (1024-dim) |
|
|
- **DiT Parameters**: ~3 billion |
|
|
- **ODE Solver**: Midpoint method (default 16 steps) |
|
|
|
|
|
## Exporting Models |
|
|
|
|
|
### Export FP16 DiT (Recommended) |
|
|
```bash |
|
|
python -m onnx_export.export_dit \ |
|
|
--output-dir ./my_models \ |
|
|
--model-id facebook/sam-audio-large \ |
|
|
--fp16 \ |
|
|
--device cuda |
|
|
``` |
|
|
|
|
|
### Export Other Components |
|
|
```bash |
|
|
python -m onnx_export.export_dacvae --output-dir ./my_models --model-id facebook/sam-audio-large |
|
|
python -m onnx_export.export_t5 --output-dir ./my_models --model-id facebook/sam-audio-large |
|
|
python -m onnx_export.export_vision --model facebook/sam-audio-large --output ./my_models |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-large) for full terms. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio). |
|
|
|