File size: 4,353 Bytes
d53caf4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
language:
- multilingual
- en
- zh
- fr
- de
- es
- ja
- ko
- pt
- it
- ru
- ar
- hi
- tr
- pl
- nl
- sv
- da
- fi
- no
- cs
- ro
- hu
tags:
- text-to-speech
- executorch
- on-device
- android
- voice-cloning
- chatterbox
license: apache-2.0
---

# Chatterbox Multilingual TTS β€” ExecuTorch Models

Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/).

**πŸ“¦ Code & export scripts:** [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub

---

## What's Here

9 ExecuTorch `.pte` files covering the complete TTS pipeline β€” from text input to 24kHz waveform β€” with zero PyTorch runtime required:

| File | Size | Backend | Precision | Stage |
|------|------|---------|-----------|-------|
| `voice_encoder.pte` | 7 MB | portable | FP32 | Speaker embedding |
| `xvector_encoder.pte` | 27 MB | portable | FP32 | X-vector conditioning |
| `t3_cond_speech_emb.pte` | 49 MB | portable | FP32 | Speech token embedding |
| `t3_cond_enc.pte` | 18 MB | portable | FP32 | Text/conditioning encoder |
| `t3_prefill.pte` | 1010 MB | XNNPACK | **FP16** | T3 Transformer prefill |
| `t3_decode.pte` | 1002 MB | XNNPACK | **FP16** | T3 Transformer decode |
| `s3gen_encoder.pte` | 178 MB | portable | FP32 | S3Gen Conformer encoder |
| `cfm_step.pte` | 274 MB | XNNPACK | FP32 | CFM flow matching step |
| `hifigan.pte` | 84 MB | XNNPACK | FP32 | HiFiGAN vocoder |
| **Total** | **~2.6 GB** | | | |

---

## Quick Download

```python
from huggingface_hub import snapshot_download

snapshot_download(
    "acul3/chatterbox-executorch",
    local_dir="et_models",
    repo_type="model"
)
```

---

## Pipeline Overview

```
Text β†’ MTLTokenizer β†’ text tokens
Reference Audio β†’ VoiceEncoder + CAMPPlus β†’ speaker conditioning
                          ↓
              T3 Prefill (LlamaModel, conditioned)
                          ↓
              T3 Decode (autoregressive, ~100 tokens)
                          ↓
              S3Gen Encoder (Conformer)
                          ↓
              CFM Step Γ— 2 (flow matching)
                          ↓
              HiFiGAN (vocoder, chunked)
                          ↓
              24kHz PCM waveform 🎡
```

---

## Key Technical Notes

- **T3 Decode** uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) β€” bypasses HF `DynamicCache` for `torch.export` compatibility
- **HiFiGAN** uses a manual real-valued DFT (cosine/sine matrix multiply) β€” replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support
- **T3 models** are FP16 (XNNPACK half-precision kernels) β€” ~half the size of FP32 with near-identical quality
- **Fixed shapes:** CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio)

---

## Usage

See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch)

```bash
# Clone code
git clone https://github.com/acul3/chatterbox-executorch.git
cd chatterbox-executorch

# Download models (this repo)
python -c "
from huggingface_hub import snapshot_download
snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model')
"

# Run full PTE inference
python test_true_full_pte.py
```

---

## Android Integration

These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with:

```kotlin
val module = Module.load(context.filesDir.path + "/t3_prefill.pte")
```

With QNN/NPU delegation on a Snapdragon device, expect **10–50Γ— speedup** over the CPU timings below.

## Performance (Jetson AGX Orin, CPU only)

| Stage | Time |
|-------|------|
| Voice encoding | ~1s |
| T3 prefill | ~22s |
| T3 decode (~100 tokens) | ~800s total (~8s/token) |
| S3Gen encoder | ~2s |
| CFM (2 steps) | ~40s |
| HiFiGAN | ~10s/chunk |

---

## License

Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms.