DAC Encoder ONNX (Voice Cloning)

ONNX export of the encoder + quantizer from ibm-research/DAC.speech.v1.0.

Purpose

Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices that condition the OuteTTS model during synthesis.

Model Details

Source: weights_24khz_1.5kbps_v1.0.pth from ibm-research/DAC.speech.v1.0
Components: Encoder (21.5M params) + ResidualVectorQuantize (53K params)
Input: (batch, 1, samples) float32 PCM at 24kHz, values in [-1, 1]
Output: (batch, 2, frames) int64 codes, values 0-1023
Frame rate: 75 frames per second (320x downsampling)
Codebooks: 2 quantizers, each with 1024 entries
Export: PyTorch 2.10 → ONNX opset 18, verified with ONNX Runtime 1.25

Usage

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("dac_encoder_24khz.onnx")

# 1 second of audio at 24kHz
audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1

codes = sess.run(None, {"audio": audio})
# codes[0].shape = (1, 2, 75) — 2 codebooks, 75 frames

Integration with OuteTTS 1.0

The codes map to special tokens in the OuteTTS 1.0 vocabulary:

Codebook 1: <|c1_0|> through <|c1_1024|>
Codebook 2: <|c2_0|> through <|c2_1024|>

These are interleaved per frame to create the speaker conditioning prompt.

License

MIT (same as DAC)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support