File size: 4,353 Bytes

d53caf4

---
language:
- multilingual
- en
- zh
- fr
- de
- es
- ja
- ko
- pt
- it
- ru
- ar
- hi
- tr
- pl
- nl
- sv
- da
- fi
- no
- cs
- ro
- hu
tags:
- text-to-speech
- executorch
- on-device
- android
- voice-cloning
- chatterbox
license: apache-2.0
---

# Chatterbox Multilingual TTS — ExecuTorch Models

Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/).

**📦 Code & export scripts:** [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub

---

## What's Here

9 ExecuTorch `.pte` files covering the complete TTS pipeline — from text input to 24kHz waveform — with zero PyTorch runtime required:

| File | Size | Backend | Precision | Stage |
|------|------|---------|-----------|-------|
| `voice_encoder.pte` | 7 MB | portable | FP32 | Speaker embedding |
| `xvector_encoder.pte` | 27 MB | portable | FP32 | X-vector conditioning |
| `t3_cond_speech_emb.pte` | 49 MB | portable | FP32 | Speech token embedding |
| `t3_cond_enc.pte` | 18 MB | portable | FP32 | Text/conditioning encoder |
| `t3_prefill.pte` | 1010 MB | XNNPACK | **FP16** | T3 Transformer prefill |
| `t3_decode.pte` | 1002 MB | XNNPACK | **FP16** | T3 Transformer decode |
| `s3gen_encoder.pte` | 178 MB | portable | FP32 | S3Gen Conformer encoder |
| `cfm_step.pte` | 274 MB | XNNPACK | FP32 | CFM flow matching step |
| `hifigan.pte` | 84 MB | XNNPACK | FP32 | HiFiGAN vocoder |
| **Total** | **~2.6 GB** | | | |

---

## Quick Download

```python
from huggingface_hub import snapshot_download

snapshot_download(
    "acul3/chatterbox-executorch",
    local_dir="et_models",
    repo_type="model"
)
```

---

## Pipeline Overview

```
Text → MTLTokenizer → text tokens
Reference Audio → VoiceEncoder + CAMPPlus → speaker conditioning
                          ↓
              T3 Prefill (LlamaModel, conditioned)
                          ↓
              T3 Decode (autoregressive, ~100 tokens)
                          ↓
              S3Gen Encoder (Conformer)
                          ↓
              CFM Step × 2 (flow matching)
                          ↓
              HiFiGAN (vocoder, chunked)
                          ↓
              24kHz PCM waveform 🎵
```

---

## Key Technical Notes

- **T3 Decode** uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) — bypasses HF `DynamicCache` for `torch.export` compatibility
- **HiFiGAN** uses a manual real-valued DFT (cosine/sine matrix multiply) — replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support
- **T3 models** are FP16 (XNNPACK half-precision kernels) — ~half the size of FP32 with near-identical quality
- **Fixed shapes:** CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio)

---

## Usage

See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch)

```bash
# Clone code
git clone https://github.com/acul3/chatterbox-executorch.git
cd chatterbox-executorch

# Download models (this repo)
python -c "
from huggingface_hub import snapshot_download
snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model')
"

# Run full PTE inference
python test_true_full_pte.py
```

---

## Android Integration

These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with:

```kotlin
val module = Module.load(context.filesDir.path + "/t3_prefill.pte")
```

With QNN/NPU delegation on a Snapdragon device, expect **10–50× speedup** over the CPU timings below.

## Performance (Jetson AGX Orin, CPU only)

| Stage | Time |
|-------|------|
| Voice encoding | ~1s |
| T3 prefill | ~22s |
| T3 decode (~100 tokens) | ~800s total (~8s/token) |
| S3Gen encoder | ~2s |
| CFM (2 steps) | ~40s |
| HiFiGAN | ~10s/chunk |

---

## License

Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms.