Mearman commited on
Commit
7ff0bec
·
verified ·
1 Parent(s): c5a6ec9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # DAC Encoder ONNX (Voice Cloning)
6
+
7
+ ONNX export of the **encoder + quantizer** from [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0).
8
+
9
+ ## Purpose
10
+
11
+ Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices
12
+ that condition the OuteTTS model during synthesis.
13
+
14
+ ## Model Details
15
+
16
+ - **Source**: `weights_24khz_1.5kbps_v1.0.pth` from `ibm-research/DAC.speech.v1.0`
17
+ - **Components**: Encoder (21.5M params) + ResidualVectorQuantize (53K params)
18
+ - **Input**: `(batch, 1, samples)` float32 PCM at 24kHz, values in [-1, 1]
19
+ - **Output**: `(batch, 2, frames)` int64 codes, values 0-1023
20
+ - **Frame rate**: 75 frames per second (320x downsampling)
21
+ - **Codebooks**: 2 quantizers, each with 1024 entries
22
+ - **Export**: PyTorch 2.10 → ONNX opset 18, verified with ONNX Runtime 1.25
23
+
24
+ ## Usage
25
+
26
+ ```python
27
+ import onnxruntime as ort
28
+ import numpy as np
29
+
30
+ sess = ort.InferenceSession("dac_encoder_24khz.onnx")
31
+
32
+ # 1 second of audio at 24kHz
33
+ audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1
34
+
35
+ codes = sess.run(None, {"audio": audio})
36
+ # codes[0].shape = (1, 2, 75) — 2 codebooks, 75 frames
37
+ ```
38
+
39
+ ## Integration with OuteTTS 1.0
40
+
41
+ The codes map to special tokens in the OuteTTS 1.0 vocabulary:
42
+ - Codebook 1: `<|c1_0|>` through `<|c1_1024|>`
43
+ - Codebook 2: `<|c2_0|>` through `<|c2_1024|>`
44
+
45
+ These are interleaved per frame to create the speaker conditioning prompt.
46
+
47
+ ## License
48
+
49
+ MIT (same as DAC)