DAC Encoder ONNX (Voice Cloning)
ONNX export of the encoder + quantizer from ibm-research/DAC.speech.v1.0.
Purpose
Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices that condition the OuteTTS model during synthesis.
Model Details
- Source:
weights_24khz_1.5kbps_v1.0.pthfromibm-research/DAC.speech.v1.0 - Components: Encoder (21.5M params) + ResidualVectorQuantize (53K params)
- Input:
(batch, 1, samples)float32 PCM at 24kHz, values in [-1, 1] - Output:
(batch, 2, frames)int64 codes, values 0-1023 - Frame rate: 75 frames per second (320x downsampling)
- Codebooks: 2 quantizers, each with 1024 entries
- Export: PyTorch 2.10 โ ONNX opset 18, verified with ONNX Runtime 1.25
Usage
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("dac_encoder_24khz.onnx")
# 1 second of audio at 24kHz
audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1
codes = sess.run(None, {"audio": audio})
# codes[0].shape = (1, 2, 75) โ 2 codebooks, 75 frames
Integration with OuteTTS 1.0
The codes map to special tokens in the OuteTTS 1.0 vocabulary:
- Codebook 1:
<|c1_0|>through<|c1_1024|> - Codebook 2:
<|c2_0|>through<|c2_1024|>
These are interleaved per frame to create the speaker conditioning prompt.
License
MIT (same as DAC)
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support