File size: 1,484 Bytes
7ff0bec | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | ---
license: mit
---
# DAC Encoder ONNX (Voice Cloning)
ONNX export of the **encoder + quantizer** from [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0).
## Purpose
Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices
that condition the OuteTTS model during synthesis.
## Model Details
- **Source**: `weights_24khz_1.5kbps_v1.0.pth` from `ibm-research/DAC.speech.v1.0`
- **Components**: Encoder (21.5M params) + ResidualVectorQuantize (53K params)
- **Input**: `(batch, 1, samples)` float32 PCM at 24kHz, values in [-1, 1]
- **Output**: `(batch, 2, frames)` int64 codes, values 0-1023
- **Frame rate**: 75 frames per second (320x downsampling)
- **Codebooks**: 2 quantizers, each with 1024 entries
- **Export**: PyTorch 2.10 → ONNX opset 18, verified with ONNX Runtime 1.25
## Usage
```python
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("dac_encoder_24khz.onnx")
# 1 second of audio at 24kHz
audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1
codes = sess.run(None, {"audio": audio})
# codes[0].shape = (1, 2, 75) — 2 codebooks, 75 frames
```
## Integration with OuteTTS 1.0
The codes map to special tokens in the OuteTTS 1.0 vocabulary:
- Codebook 1: `<|c1_0|>` through `<|c1_1024|>`
- Codebook 2: `<|c2_0|>` through `<|c2_1024|>`
These are interleaved per frame to create the speaker conditioning prompt.
## License
MIT (same as DAC)
|