| --- |
| language: |
| - multilingual |
| - en |
| - zh |
| - fr |
| - de |
| - es |
| - ja |
| - ko |
| - pt |
| - it |
| - ru |
| - ar |
| - hi |
| - tr |
| - pl |
| - nl |
| - sv |
| - da |
| - fi |
| - no |
| - cs |
| - ro |
| - hu |
| tags: |
| - text-to-speech |
| - executorch |
| - on-device |
| - android |
| - voice-cloning |
| - chatterbox |
| license: apache-2.0 |
| --- |
| |
| # Chatterbox Multilingual TTS β ExecuTorch Models |
|
|
| Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/). |
|
|
| **π¦ Code & export scripts:** [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub |
|
|
| --- |
|
|
| ## What's Here |
|
|
| 9 ExecuTorch `.pte` files covering the complete TTS pipeline β from text input to 24kHz waveform β with zero PyTorch runtime required: |
|
|
| | File | Size | Backend | Precision | Stage | |
| |------|------|---------|-----------|-------| |
| | `voice_encoder.pte` | 7 MB | portable | FP32 | Speaker embedding | |
| | `xvector_encoder.pte` | 27 MB | portable | FP32 | X-vector conditioning | |
| | `t3_cond_speech_emb.pte` | 49 MB | portable | FP32 | Speech token embedding | |
| | `t3_cond_enc.pte` | 18 MB | portable | FP32 | Text/conditioning encoder | |
| | `t3_prefill.pte` | 1010 MB | XNNPACK | **FP16** | T3 Transformer prefill | |
| | `t3_decode.pte` | 1002 MB | XNNPACK | **FP16** | T3 Transformer decode | |
| | `s3gen_encoder.pte` | 178 MB | portable | FP32 | S3Gen Conformer encoder | |
| | `cfm_step.pte` | 274 MB | XNNPACK | FP32 | CFM flow matching step | |
| | `hifigan.pte` | 84 MB | XNNPACK | FP32 | HiFiGAN vocoder | |
| | **Total** | **~2.6 GB** | | | | |
|
|
| --- |
|
|
| ## Quick Download |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| snapshot_download( |
| "acul3/chatterbox-executorch", |
| local_dir="et_models", |
| repo_type="model" |
| ) |
| ``` |
|
|
| --- |
|
|
| ## Pipeline Overview |
|
|
| ``` |
| Text β MTLTokenizer β text tokens |
| Reference Audio β VoiceEncoder + CAMPPlus β speaker conditioning |
| β |
| T3 Prefill (LlamaModel, conditioned) |
| β |
| T3 Decode (autoregressive, ~100 tokens) |
| β |
| S3Gen Encoder (Conformer) |
| β |
| CFM Step Γ 2 (flow matching) |
| β |
| HiFiGAN (vocoder, chunked) |
| β |
| 24kHz PCM waveform π΅ |
| ``` |
|
|
| --- |
|
|
| ## Key Technical Notes |
|
|
| - **T3 Decode** uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) β bypasses HF `DynamicCache` for `torch.export` compatibility |
| - **HiFiGAN** uses a manual real-valued DFT (cosine/sine matrix multiply) β replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support |
| - **T3 models** are FP16 (XNNPACK half-precision kernels) β ~half the size of FP32 with near-identical quality |
| - **Fixed shapes:** CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio) |
|
|
| --- |
|
|
| ## Usage |
|
|
| See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) |
|
|
| ```bash |
| # Clone code |
| git clone https://github.com/acul3/chatterbox-executorch.git |
| cd chatterbox-executorch |
| |
| # Download models (this repo) |
| python -c " |
| from huggingface_hub import snapshot_download |
| snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model') |
| " |
| |
| # Run full PTE inference |
| python test_true_full_pte.py |
| ``` |
|
|
| --- |
|
|
| ## Android Integration |
|
|
| These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with: |
|
|
| ```kotlin |
| val module = Module.load(context.filesDir.path + "/t3_prefill.pte") |
| ``` |
|
|
| With QNN/NPU delegation on a Snapdragon device, expect **10β50Γ speedup** over the CPU timings below. |
|
|
| ## Performance (Jetson AGX Orin, CPU only) |
|
|
| | Stage | Time | |
| |-------|------| |
| | Voice encoding | ~1s | |
| | T3 prefill | ~22s | |
| | T3 decode (~100 tokens) | ~800s total (~8s/token) | |
| | S3Gen encoder | ~2s | |
| | CFM (2 steps) | ~40s | |
| | HiFiGAN | ~10s/chunk | |
|
|
| --- |
|
|
| ## License |
|
|
| Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms. |
|
|