Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# DAC Encoder ONNX (Voice Cloning)
|
| 6 |
+
|
| 7 |
+
ONNX export of the **encoder + quantizer** from [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0).
|
| 8 |
+
|
| 9 |
+
## Purpose
|
| 10 |
+
|
| 11 |
+
Used for on-device voice cloning with OuteTTS 1.0. Encodes reference audio into discrete codebook indices
|
| 12 |
+
that condition the OuteTTS model during synthesis.
|
| 13 |
+
|
| 14 |
+
## Model Details
|
| 15 |
+
|
| 16 |
+
- **Source**: `weights_24khz_1.5kbps_v1.0.pth` from `ibm-research/DAC.speech.v1.0`
|
| 17 |
+
- **Components**: Encoder (21.5M params) + ResidualVectorQuantize (53K params)
|
| 18 |
+
- **Input**: `(batch, 1, samples)` float32 PCM at 24kHz, values in [-1, 1]
|
| 19 |
+
- **Output**: `(batch, 2, frames)` int64 codes, values 0-1023
|
| 20 |
+
- **Frame rate**: 75 frames per second (320x downsampling)
|
| 21 |
+
- **Codebooks**: 2 quantizers, each with 1024 entries
|
| 22 |
+
- **Export**: PyTorch 2.10 → ONNX opset 18, verified with ONNX Runtime 1.25
|
| 23 |
+
|
| 24 |
+
## Usage
|
| 25 |
+
|
| 26 |
+
```python
|
| 27 |
+
import onnxruntime as ort
|
| 28 |
+
import numpy as np
|
| 29 |
+
|
| 30 |
+
sess = ort.InferenceSession("dac_encoder_24khz.onnx")
|
| 31 |
+
|
| 32 |
+
# 1 second of audio at 24kHz
|
| 33 |
+
audio = np.random.randn(1, 1, 24000).astype(np.float32) * 0.1
|
| 34 |
+
|
| 35 |
+
codes = sess.run(None, {"audio": audio})
|
| 36 |
+
# codes[0].shape = (1, 2, 75) — 2 codebooks, 75 frames
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
## Integration with OuteTTS 1.0
|
| 40 |
+
|
| 41 |
+
The codes map to special tokens in the OuteTTS 1.0 vocabulary:
|
| 42 |
+
- Codebook 1: `<|c1_0|>` through `<|c1_1024|>`
|
| 43 |
+
- Codebook 2: `<|c2_0|>` through `<|c2_1024|>`
|
| 44 |
+
|
| 45 |
+
These are interleaved per frame to create the speaker conditioning prompt.
|
| 46 |
+
|
| 47 |
+
## License
|
| 48 |
+
|
| 49 |
+
MIT (same as DAC)
|