File size: 1,484 Bytes
7ff0bec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
license: mit
---

# DAC Encoder ONNX (Voice Cloning)

ONNX export of the **encoder + quantizer** from [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0).

## Purpose

Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices
that condition the OuteTTS model during synthesis.

## Model Details

- **Source**: `weights_24khz_1.5kbps_v1.0.pth` from `ibm-research/DAC.speech.v1.0`
- **Components**: Encoder (21.5M params) + ResidualVectorQuantize (53K params)
- **Input**: `(batch, 1, samples)` float32 PCM at 24kHz, values in [-1, 1]
- **Output**: `(batch, 2, frames)` int64 codes, values 0-1023
- **Frame rate**: 75 frames per second (320x downsampling)
- **Codebooks**: 2 quantizers, each with 1024 entries
- **Export**: PyTorch 2.10 → ONNX opset 18, verified with ONNX Runtime 1.25

## Usage

```python
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("dac_encoder_24khz.onnx")

# 1 second of audio at 24kHz
audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1

codes = sess.run(None, {"audio": audio})
# codes[0].shape = (1, 2, 75) — 2 codebooks, 75 frames
```

## Integration with OuteTTS 1.0

The codes map to special tokens in the OuteTTS 1.0 vocabulary:
- Codebook 1: `<|c1_0|>` through `<|c1_1024|>`
- Codebook 2: `<|c2_0|>` through `<|c2_1024|>`

These are interleaved per frame to create the speaker conditioning prompt.

## License

MIT (same as DAC)