Mearman
/

dac-encoder-onnx

ONNX

Model card Files Files and versions

xet

Community

Mearman commited on 26 days ago

Commit

7ff0bec

verified ·

1 Parent(s): c5a6ec9

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +49 -0

README.md ADDED Viewed

	@@ -0,0 +1,49 @@

+---
+license: mit
+---
+# DAC Encoder ONNX (Voice Cloning)
+ONNX export of the **encoder + quantizer** from [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0).
+## Purpose
+Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices
+that condition the OuteTTS model during synthesis.
+## Model Details
+- **Source**: `weights_24khz_1.5kbps_v1.0.pth` from `ibm-research/DAC.speech.v1.0`
+- **Components**: Encoder (21.5M params) + ResidualVectorQuantize (53K params)
+- **Input**: `(batch, 1, samples)` float32 PCM at 24kHz, values in [-1, 1]
+- **Output**: `(batch, 2, frames)` int64 codes, values 0-1023
+- **Frame rate**: 75 frames per second (320x downsampling)
+- **Codebooks**: 2 quantizers, each with 1024 entries
+- **Export**: PyTorch 2.10 → ONNX opset 18, verified with ONNX Runtime 1.25
+## Usage
+```python
+import onnxruntime as ort
+import numpy as np
+sess = ort.InferenceSession("dac_encoder_24khz.onnx")
+# 1 second of audio at 24kHz
+audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1
+codes = sess.run(None, {"audio": audio})
+# codes[0].shape = (1, 2, 75) — 2 codebooks, 75 frames
+```
+## Integration with OuteTTS 1.0
+The codes map to special tokens in the OuteTTS 1.0 vocabulary:
+- Codebook 1: `<|c1_0|>` through `<|c1_1024|>`
+- Codebook 2: `<|c2_0|>` through `<|c2_1024|>`
+These are interleaved per frame to create the speaker conditioning prompt.
+## License
+MIT (same as DAC)