DAC Encoder ONNX (Voice Cloning)

ONNX export of the encoder + quantizer from ibm-research/DAC.speech.v1.0.

Purpose

Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices that condition the OuteTTS model during synthesis.

Model Details

  • Source: weights_24khz_1.5kbps_v1.0.pth from ibm-research/DAC.speech.v1.0
  • Components: Encoder (21.5M params) + ResidualVectorQuantize (53K params)
  • Input: (batch, 1, samples) float32 PCM at 24kHz, values in [-1, 1]
  • Output: (batch, 2, frames) int64 codes, values 0-1023
  • Frame rate: 75 frames per second (320x downsampling)
  • Codebooks: 2 quantizers, each with 1024 entries
  • Export: PyTorch 2.10 โ†’ ONNX opset 18, verified with ONNX Runtime 1.25

Usage

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("dac_encoder_24khz.onnx")

# 1 second of audio at 24kHz
audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1

codes = sess.run(None, {"audio": audio})
# codes[0].shape = (1, 2, 75) โ€” 2 codebooks, 75 frames

Integration with OuteTTS 1.0

The codes map to special tokens in the OuteTTS 1.0 vocabulary:

  • Codebook 1: <|c1_0|> through <|c1_1024|>
  • Codebook 2: <|c2_0|> through <|c2_1024|>

These are interleaved per frame to create the speaker conditioning prompt.

License

MIT (same as DAC)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support