--- language: - multilingual - en - zh - fr - de - es - ja - ko - pt - it - ru - ar - hi - tr - pl - nl - sv - da - fi - no - cs - ro - hu tags: - text-to-speech - executorch - on-device - android - voice-cloning - chatterbox license: apache-2.0 --- # Chatterbox Multilingual TTS — ExecuTorch Models Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/). **📦 Code & export scripts:** [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub --- ## What's Here 9 ExecuTorch `.pte` files covering the complete TTS pipeline — from text input to 24kHz waveform — with zero PyTorch runtime required: | File | Size | Backend | Precision | Stage | |------|------|---------|-----------|-------| | `voice_encoder.pte` | 7 MB | portable | FP32 | Speaker embedding | | `xvector_encoder.pte` | 27 MB | portable | FP32 | X-vector conditioning | | `t3_cond_speech_emb.pte` | 49 MB | portable | FP32 | Speech token embedding | | `t3_cond_enc.pte` | 18 MB | portable | FP32 | Text/conditioning encoder | | `t3_prefill.pte` | 1010 MB | XNNPACK | **FP16** | T3 Transformer prefill | | `t3_decode.pte` | 1002 MB | XNNPACK | **FP16** | T3 Transformer decode | | `s3gen_encoder.pte` | 178 MB | portable | FP32 | S3Gen Conformer encoder | | `cfm_step.pte` | 274 MB | XNNPACK | FP32 | CFM flow matching step | | `hifigan.pte` | 84 MB | XNNPACK | FP32 | HiFiGAN vocoder | | **Total** | **~2.6 GB** | | | | --- ## Quick Download ```python from huggingface_hub import snapshot_download snapshot_download( "acul3/chatterbox-executorch", local_dir="et_models", repo_type="model" ) ``` --- ## Pipeline Overview ``` Text → MTLTokenizer → text tokens Reference Audio → VoiceEncoder + CAMPPlus → speaker conditioning ↓ T3 Prefill (LlamaModel, conditioned) ↓ T3 Decode (autoregressive, ~100 tokens) ↓ S3Gen Encoder (Conformer) ↓ CFM Step × 2 (flow matching) ↓ HiFiGAN (vocoder, chunked) ↓ 24kHz PCM waveform 🎵 ``` --- ## Key Technical Notes - **T3 Decode** uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) — bypasses HF `DynamicCache` for `torch.export` compatibility - **HiFiGAN** uses a manual real-valued DFT (cosine/sine matrix multiply) — replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support - **T3 models** are FP16 (XNNPACK half-precision kernels) — ~half the size of FP32 with near-identical quality - **Fixed shapes:** CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio) --- ## Usage See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) ```bash # Clone code git clone https://github.com/acul3/chatterbox-executorch.git cd chatterbox-executorch # Download models (this repo) python -c " from huggingface_hub import snapshot_download snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model') " # Run full PTE inference python test_true_full_pte.py ``` --- ## Android Integration These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with: ```kotlin val module = Module.load(context.filesDir.path + "/t3_prefill.pte") ``` With QNN/NPU delegation on a Snapdragon device, expect **10–50× speedup** over the CPU timings below. ## Performance (Jetson AGX Orin, CPU only) | Stage | Time | |-------|------| | Voice encoding | ~1s | | T3 prefill | ~22s | | T3 decode (~100 tokens) | ~800s total (~8s/token) | | S3Gen encoder | ~2s | | CFM (2 steps) | ~40s | | HiFiGAN | ~10s/chunk | --- ## License Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms.