File size: 3,067 Bytes
07823f7
 
 
 
 
 
 
 
 
 
 
583e52c
 
 
 
07823f7
 
 
 
 
 
 
 
583e52c
07823f7
 
 
 
 
 
 
 
 
 
583e52c
 
 
 
 
07823f7
583e52c
 
 
07823f7
583e52c
07823f7
583e52c
 
07823f7
583e52c
07823f7
 
 
583e52c
 
07823f7
583e52c
 
 
07823f7
 
 
583e52c
 
07823f7
583e52c
 
07823f7
 
 
 
583e52c
 
07823f7
583e52c
 
 
 
 
 
07823f7
 
583e52c
07823f7
583e52c
07823f7
 
 
 
 
 
 
 
583e52c
07823f7
 
 
 
 
583e52c
 
07823f7
 
 
 
583e52c
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: other
base_model: facebook/sam-audio-large
tags:
- onnx
- audio
- sam-audio
- source-separation
- audio-visual
---

# SAM-Audio ONNX (Large)

ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-large) - Meta's Semantic Audio Modeling for audio source separation.

This repository contains both **FP32** and **FP16** versions of the models.

## Model Variants

| Variant | DiT Size | Total Size | Notes |
|---------|----------|------------|-------|
| `fp32/` | 11.76 GB | ~13.9 GB | Full precision |
| `fp16/` | 5.88 GB | ~8.0 GB | Half precision (recommended) |

## Model Files (per variant)

| File | Description | FP32 Size | FP16 Size |
|------|-------------|-----------|-----------|
| `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | 110 MB | 110 MB |
| `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | 320 MB | 320 MB |
| `t5_encoder.onnx` | Text encoder (T5-base) | 440 MB | 440 MB |
| `dit_single_step.onnx` | DiT denoiser (3B params) | 11.76 GB | 5.88 GB |
| `vision_encoder.onnx` | Vision encoder (CLIP-based) | 1.27 GB | 1.27 GB |
| `tokenizer/` | SentencePiece tokenizer files | - | - |

## Installation

```bash
pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile
# For CUDA support (recommended for large model):
pip install onnxruntime-gpu
```

## Usage

### Using FP16 Models (Recommended)
```bash
python onnx_inference.py \
    --video input.mp4 \
    --text "a person speaking" \
    --model-dir fp16 \
    --output target.wav \
    --output-residual residual.wav
```

### Using FP32 Models
```bash
python onnx_inference.py \
    --video input.mp4 \
    --text "keyboard typing" \
    --model-dir fp32 \
    --output target.wav
```

### Audio-Only Mode
```bash
python onnx_inference.py \
    --audio input.wav \
    --text "drums" \
    --model-dir fp16 \
    --output drums.wav
```

## Model Specifications

- **Audio Sample Rate**: 48kHz
- **Audio Hop Length**: 1536 samples
- **Vision Input Size**: 336×336 pixels
- **Text Encoder**: T5-base (768-dim)
- **Vision Encoder**: PE-Core-L14-336 (1024-dim)
- **DiT Parameters**: ~3 billion
- **ODE Solver**: Midpoint method (default 16 steps)

## Exporting Models

### Export FP16 DiT (Recommended)
```bash
python -m onnx_export.export_dit \
    --output-dir ./my_models \
    --model-id facebook/sam-audio-large \
    --fp16 \
    --device cuda
```

### Export Other Components
```bash
python -m onnx_export.export_dacvae --output-dir ./my_models --model-id facebook/sam-audio-large
python -m onnx_export.export_t5 --output-dir ./my_models --model-id facebook/sam-audio-large
python -m onnx_export.export_vision --model facebook/sam-audio-large --output ./my_models
```

## License

SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-large) for full terms.

## Acknowledgments

Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).