--- license: mit --- # DAC Encoder ONNX (Voice Cloning) ONNX export of the **encoder + quantizer** from [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0). ## Purpose Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices that condition the OuteTTS model during synthesis. ## Model Details - **Source**: `weights_24khz_1.5kbps_v1.0.pth` from `ibm-research/DAC.speech.v1.0` - **Components**: Encoder (21.5M params) + ResidualVectorQuantize (53K params) - **Input**: `(batch, 1, samples)` float32 PCM at 24kHz, values in [-1, 1] - **Output**: `(batch, 2, frames)` int64 codes, values 0-1023 - **Frame rate**: 75 frames per second (320x downsampling) - **Codebooks**: 2 quantizers, each with 1024 entries - **Export**: PyTorch 2.10 → ONNX opset 18, verified with ONNX Runtime 1.25 ## Usage ```python import onnxruntime as ort import numpy as np sess = ort.InferenceSession("dac_encoder_24khz.onnx") # 1 second of audio at 24kHz audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1 codes = sess.run(None, {"audio": audio}) # codes[0].shape = (1, 2, 75) — 2 codebooks, 75 frames ``` ## Integration with OuteTTS 1.0 The codes map to special tokens in the OuteTTS 1.0 vocabulary: - Codebook 1: `<|c1_0|>` through `<|c1_1024|>` - Codebook 2: `<|c2_0|>` through `<|c2_1024|>` These are interleaved per frame to create the speaker conditioning prompt. ## License MIT (same as DAC)