File size: 4,353 Bytes
d53caf4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | ---
language:
- multilingual
- en
- zh
- fr
- de
- es
- ja
- ko
- pt
- it
- ru
- ar
- hi
- tr
- pl
- nl
- sv
- da
- fi
- no
- cs
- ro
- hu
tags:
- text-to-speech
- executorch
- on-device
- android
- voice-cloning
- chatterbox
license: apache-2.0
---
# Chatterbox Multilingual TTS β ExecuTorch Models
Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/).
**π¦ Code & export scripts:** [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub
---
## What's Here
9 ExecuTorch `.pte` files covering the complete TTS pipeline β from text input to 24kHz waveform β with zero PyTorch runtime required:
| File | Size | Backend | Precision | Stage |
|------|------|---------|-----------|-------|
| `voice_encoder.pte` | 7 MB | portable | FP32 | Speaker embedding |
| `xvector_encoder.pte` | 27 MB | portable | FP32 | X-vector conditioning |
| `t3_cond_speech_emb.pte` | 49 MB | portable | FP32 | Speech token embedding |
| `t3_cond_enc.pte` | 18 MB | portable | FP32 | Text/conditioning encoder |
| `t3_prefill.pte` | 1010 MB | XNNPACK | **FP16** | T3 Transformer prefill |
| `t3_decode.pte` | 1002 MB | XNNPACK | **FP16** | T3 Transformer decode |
| `s3gen_encoder.pte` | 178 MB | portable | FP32 | S3Gen Conformer encoder |
| `cfm_step.pte` | 274 MB | XNNPACK | FP32 | CFM flow matching step |
| `hifigan.pte` | 84 MB | XNNPACK | FP32 | HiFiGAN vocoder |
| **Total** | **~2.6 GB** | | | |
---
## Quick Download
```python
from huggingface_hub import snapshot_download
snapshot_download(
"acul3/chatterbox-executorch",
local_dir="et_models",
repo_type="model"
)
```
---
## Pipeline Overview
```
Text β MTLTokenizer β text tokens
Reference Audio β VoiceEncoder + CAMPPlus β speaker conditioning
β
T3 Prefill (LlamaModel, conditioned)
β
T3 Decode (autoregressive, ~100 tokens)
β
S3Gen Encoder (Conformer)
β
CFM Step Γ 2 (flow matching)
β
HiFiGAN (vocoder, chunked)
β
24kHz PCM waveform π΅
```
---
## Key Technical Notes
- **T3 Decode** uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) β bypasses HF `DynamicCache` for `torch.export` compatibility
- **HiFiGAN** uses a manual real-valued DFT (cosine/sine matrix multiply) β replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support
- **T3 models** are FP16 (XNNPACK half-precision kernels) β ~half the size of FP32 with near-identical quality
- **Fixed shapes:** CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio)
---
## Usage
See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch)
```bash
# Clone code
git clone https://github.com/acul3/chatterbox-executorch.git
cd chatterbox-executorch
# Download models (this repo)
python -c "
from huggingface_hub import snapshot_download
snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model')
"
# Run full PTE inference
python test_true_full_pte.py
```
---
## Android Integration
These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with:
```kotlin
val module = Module.load(context.filesDir.path + "/t3_prefill.pte")
```
With QNN/NPU delegation on a Snapdragon device, expect **10β50Γ speedup** over the CPU timings below.
## Performance (Jetson AGX Orin, CPU only)
| Stage | Time |
|-------|------|
| Voice encoding | ~1s |
| T3 prefill | ~22s |
| T3 decode (~100 tokens) | ~800s total (~8s/token) |
| S3Gen encoder | ~2s |
| CFM (2 steps) | ~40s |
| HiFiGAN | ~10s/chunk |
---
## License
Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms.
|