acul3's picture
Upload 9 ExecuTorch .pte models (FP16, 2.6GB total)
d53caf4 verified
---
language:
- multilingual
- en
- zh
- fr
- de
- es
- ja
- ko
- pt
- it
- ru
- ar
- hi
- tr
- pl
- nl
- sv
- da
- fi
- no
- cs
- ro
- hu
tags:
- text-to-speech
- executorch
- on-device
- android
- voice-cloning
- chatterbox
license: apache-2.0
---
# Chatterbox Multilingual TTS β€” ExecuTorch Models
Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/).
**πŸ“¦ Code & export scripts:** [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub
---
## What's Here
9 ExecuTorch `.pte` files covering the complete TTS pipeline β€” from text input to 24kHz waveform β€” with zero PyTorch runtime required:
| File | Size | Backend | Precision | Stage |
|------|------|---------|-----------|-------|
| `voice_encoder.pte` | 7 MB | portable | FP32 | Speaker embedding |
| `xvector_encoder.pte` | 27 MB | portable | FP32 | X-vector conditioning |
| `t3_cond_speech_emb.pte` | 49 MB | portable | FP32 | Speech token embedding |
| `t3_cond_enc.pte` | 18 MB | portable | FP32 | Text/conditioning encoder |
| `t3_prefill.pte` | 1010 MB | XNNPACK | **FP16** | T3 Transformer prefill |
| `t3_decode.pte` | 1002 MB | XNNPACK | **FP16** | T3 Transformer decode |
| `s3gen_encoder.pte` | 178 MB | portable | FP32 | S3Gen Conformer encoder |
| `cfm_step.pte` | 274 MB | XNNPACK | FP32 | CFM flow matching step |
| `hifigan.pte` | 84 MB | XNNPACK | FP32 | HiFiGAN vocoder |
| **Total** | **~2.6 GB** | | | |
---
## Quick Download
```python
from huggingface_hub import snapshot_download
snapshot_download(
"acul3/chatterbox-executorch",
local_dir="et_models",
repo_type="model"
)
```
---
## Pipeline Overview
```
Text β†’ MTLTokenizer β†’ text tokens
Reference Audio β†’ VoiceEncoder + CAMPPlus β†’ speaker conditioning
↓
T3 Prefill (LlamaModel, conditioned)
↓
T3 Decode (autoregressive, ~100 tokens)
↓
S3Gen Encoder (Conformer)
↓
CFM Step Γ— 2 (flow matching)
↓
HiFiGAN (vocoder, chunked)
↓
24kHz PCM waveform 🎡
```
---
## Key Technical Notes
- **T3 Decode** uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) β€” bypasses HF `DynamicCache` for `torch.export` compatibility
- **HiFiGAN** uses a manual real-valued DFT (cosine/sine matrix multiply) β€” replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support
- **T3 models** are FP16 (XNNPACK half-precision kernels) β€” ~half the size of FP32 with near-identical quality
- **Fixed shapes:** CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio)
---
## Usage
See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch)
```bash
# Clone code
git clone https://github.com/acul3/chatterbox-executorch.git
cd chatterbox-executorch
# Download models (this repo)
python -c "
from huggingface_hub import snapshot_download
snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model')
"
# Run full PTE inference
python test_true_full_pte.py
```
---
## Android Integration
These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with:
```kotlin
val module = Module.load(context.filesDir.path + "/t3_prefill.pte")
```
With QNN/NPU delegation on a Snapdragon device, expect **10–50Γ— speedup** over the CPU timings below.
## Performance (Jetson AGX Orin, CPU only)
| Stage | Time |
|-------|------|
| Voice encoding | ~1s |
| T3 prefill | ~22s |
| T3 decode (~100 tokens) | ~800s total (~8s/token) |
| S3Gen encoder | ~2s |
| CFM (2 steps) | ~40s |
| HiFiGAN | ~10s/chunk |
---
## License
Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms.