| --- |
| license: mit |
| --- |
| |
| # DAC Encoder ONNX (Voice Cloning) |
|
|
| ONNX export of the **encoder + quantizer** from [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0). |
|
|
| ## Purpose |
|
|
| Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices |
| that condition the OuteTTS model during synthesis. |
|
|
| ## Model Details |
|
|
| - **Source**: `weights_24khz_1.5kbps_v1.0.pth` from `ibm-research/DAC.speech.v1.0` |
| - **Components**: Encoder (21.5M params) + ResidualVectorQuantize (53K params) |
| - **Input**: `(batch, 1, samples)` float32 PCM at 24kHz, values in [-1, 1] |
| - **Output**: `(batch, 2, frames)` int64 codes, values 0-1023 |
| - **Frame rate**: 75 frames per second (320x downsampling) |
| - **Codebooks**: 2 quantizers, each with 1024 entries |
| - **Export**: PyTorch 2.10 → ONNX opset 18, verified with ONNX Runtime 1.25 |
|
|
| ## Usage |
|
|
| ```python |
| import onnxruntime as ort |
| import numpy as np |
| |
| sess = ort.InferenceSession("dac_encoder_24khz.onnx") |
| |
| # 1 second of audio at 24kHz |
| audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1 |
| |
| codes = sess.run(None, {"audio": audio}) |
| # codes[0].shape = (1, 2, 75) — 2 codebooks, 75 frames |
| ``` |
|
|
| ## Integration with OuteTTS 1.0 |
|
|
| The codes map to special tokens in the OuteTTS 1.0 vocabulary: |
| - Codebook 1: `<|c1_0|>` through `<|c1_1024|>` |
| - Codebook 2: `<|c2_0|>` through `<|c2_1024|>` |
|
|
| These are interleaved per frame to create the speaker conditioning prompt. |
|
|
| ## License |
|
|
| MIT (same as DAC) |
|
|