matbee commited on
Commit
583e52c
·
verified ·
1 Parent(s): 3dc3c73

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SAM-Audio ONNX (Large)
2
+
3
+ ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-large) - Meta's Semantic Audio Modeling for audio source separation.
4
+
5
+ ## Model Files
6
+
7
+ | File | Description | Size |
8
+ |------|-------------|------|
9
+ | `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | ~110 MB |
10
+ | `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | ~320 MB |
11
+ | `t5_encoder.onnx` | Text encoder (T5-base) | ~440 MB |
12
+ | `dit_single_step.onnx` | DiT denoiser (single ODE step) | ~2 GB |
13
+ | `vision_encoder.onnx` | Vision encoder (CLIP-based) | ~1.2 GB |
14
+ | `tokenizer/` | SentencePiece tokenizer files | - |
15
+
16
+ ## Installation
17
+
18
+ ```bash
19
+ pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile
20
+ # For CUDA support:
21
+ pip install onnxruntime-gpu
22
+ ```
23
+
24
+ ## Quick Start
25
+
26
+ ```python
27
+ import numpy as np
28
+ import onnxruntime as ort
29
+ from huggingface_hub import hf_hub_download
30
+
31
+ # Download models
32
+ model_dir = "sam-audio-large-onnx"
33
+ for f in ["dacvae_encoder.onnx", "dacvae_decoder.onnx", "t5_encoder.onnx",
34
+ "dit_single_step.onnx", "vision_encoder.onnx"]:
35
+ hf_hub_download("matbee/sam-audio-large-onnx", f, local_dir=model_dir)
36
+ if f != "vision_encoder.onnx": # vision encoder embeds weights
37
+ hf_hub_download("matbee/sam-audio-large-onnx", f + ".data", local_dir=model_dir)
38
+ ```
39
+
40
+ ## Usage Examples
41
+
42
+ ### Audio-Only Separation
43
+ ```bash
44
+ python onnx_inference.py \
45
+ --audio input.wav \
46
+ --text "a person speaking" \
47
+ --output separated.wav
48
+ ```
49
+
50
+ ### Video-Guided Separation
51
+ ```bash
52
+ python onnx_inference.py \
53
+ --video input.mp4 \
54
+ --text "the sound of typing" \
55
+ --output separated.wav
56
+ ```
57
+
58
+ ### Visual Prompting with SAM3 Mask
59
+ ```bash
60
+ # First generate a mask with SAM3 (see generate_sam3_mask.py)
61
+ python onnx_inference.py \
62
+ --video input.mp4 \
63
+ --mask object_mask.mp4 \
64
+ --text "" \
65
+ --output isolated.wav \
66
+ --output-video visualization.mp4
67
+ ```
68
+
69
+ ## Model Details
70
+
71
+ - **Audio Sample Rate**: 48kHz
72
+ - **Audio Hop Length**: 1536 samples
73
+ - **Vision Input Size**: 336×336 pixels
74
+ - **Text Encoder**: T5-base (768-dim)
75
+ - **Vision Encoder**: PE-Core-L14-336 (1024-dim)
76
+ - **ODE Solver**: Midpoint method (configurable steps)
77
+
78
+ ## License
79
+
80
+ SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/).
81
+
82
+ ## Citation
83
+
84
+ ```bibtex
85
+ @article{samaudio2024,
86
+ title={SAM-Audio: Semantic Audio Modeling},
87
+ author={Meta AI},
88
+ year={2024}
89
+ }
90
+ ```
91
+
92
+ ## Acknowledgments
93
+
94
+ Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).
95
+ ONNX conversion by [@matbee](https://huggingface.co/matbee).