Mearman
/

dac-encoder-onnx

Model card Files Files and versions

dac-encoder-onnx / README.md

Mearman's picture

Upload README.md with huggingface_hub

7ff0bec verified 27 days ago

|

history blame contribute delete

1.48 kB

	---
	license: mit
	---

	# DAC Encoder ONNX (Voice Cloning)

	ONNX export of the encoder + quantizer from [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0).

	## Purpose

	Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices
	that condition the OuteTTS model during synthesis.

	## Model Details

	- Source: `weights_24khz_1.5kbps_v1.0.pth` from `ibm-research/DAC.speech.v1.0`
	- Components: Encoder (21.5M params) + ResidualVectorQuantize (53K params)
	- Input: `(batch, 1, samples)` float32 PCM at 24kHz, values in [-1, 1]
	- Output: `(batch, 2, frames)` int64 codes, values 0-1023
	- Frame rate: 75 frames per second (320x downsampling)
	- Codebooks: 2 quantizers, each with 1024 entries
	- Export: PyTorch 2.10 → ONNX opset 18, verified with ONNX Runtime 1.25

	## Usage

	```python
	import onnxruntime as ort
	import numpy as np

	sess = ort.InferenceSession("dac_encoder_24khz.onnx")

	# 1 second of audio at 24kHz
	audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1

	codes = sess.run(None, {"audio": audio})
	# codes[0].shape = (1, 2, 75) — 2 codebooks, 75 frames
	```

	## Integration with OuteTTS 1.0

	The codes map to special tokens in the OuteTTS 1.0 vocabulary:
	- Codebook 1: `<\|c1_0\|>` through `<\|c1_1024\|>`
	- Codebook 2: `<\|c2_0\|>` through `<\|c2_1024\|>`

	These are interleaved per frame to create the speaker conditioning prompt.

	## License

	MIT (same as DAC)