acul3's picture
Upload 9 ExecuTorch .pte models (FP16, 2.6GB total)
d53caf4 verified
metadata
language:
  - multilingual
  - en
  - zh
  - fr
  - de
  - es
  - ja
  - ko
  - pt
  - it
  - ru
  - ar
  - hi
  - tr
  - pl
  - nl
  - sv
  - da
  - fi
  - 'no'
  - cs
  - ro
  - hu
tags:
  - text-to-speech
  - executorch
  - on-device
  - android
  - voice-cloning
  - chatterbox
license: apache-2.0

Chatterbox Multilingual TTS β€” ExecuTorch Models

Pre-exported .pte model files for running Resemble AI's Chatterbox Multilingual TTS fully on-device using ExecuTorch.

πŸ“¦ Code & export scripts: acul3/chatterbox-executorch on GitHub


What's Here

9 ExecuTorch .pte files covering the complete TTS pipeline β€” from text input to 24kHz waveform β€” with zero PyTorch runtime required:

File Size Backend Precision Stage
voice_encoder.pte 7 MB portable FP32 Speaker embedding
xvector_encoder.pte 27 MB portable FP32 X-vector conditioning
t3_cond_speech_emb.pte 49 MB portable FP32 Speech token embedding
t3_cond_enc.pte 18 MB portable FP32 Text/conditioning encoder
t3_prefill.pte 1010 MB XNNPACK FP16 T3 Transformer prefill
t3_decode.pte 1002 MB XNNPACK FP16 T3 Transformer decode
s3gen_encoder.pte 178 MB portable FP32 S3Gen Conformer encoder
cfm_step.pte 274 MB XNNPACK FP32 CFM flow matching step
hifigan.pte 84 MB XNNPACK FP32 HiFiGAN vocoder
Total ~2.6 GB

Quick Download

from huggingface_hub import snapshot_download

snapshot_download(
    "acul3/chatterbox-executorch",
    local_dir="et_models",
    repo_type="model"
)

Pipeline Overview

Text β†’ MTLTokenizer β†’ text tokens
Reference Audio β†’ VoiceEncoder + CAMPPlus β†’ speaker conditioning
                          ↓
              T3 Prefill (LlamaModel, conditioned)
                          ↓
              T3 Decode (autoregressive, ~100 tokens)
                          ↓
              S3Gen Encoder (Conformer)
                          ↓
              CFM Step Γ— 2 (flow matching)
                          ↓
              HiFiGAN (vocoder, chunked)
                          ↓
              24kHz PCM waveform 🎡

Key Technical Notes

  • T3 Decode uses a manually unrolled 30-layer Llama forward pass with static KV cache (torch.where writes) β€” bypasses HF DynamicCache for torch.export compatibility
  • HiFiGAN uses a manual real-valued DFT (cosine/sine matrix multiply) β€” replaces torch.stft/torch.istft which XNNPACK doesn't support
  • T3 models are FP16 (XNNPACK half-precision kernels) β€” ~half the size of FP32 with near-identical quality
  • Fixed shapes: CFM expects T_MEL=2200, HiFiGAN expects T_MEL=300 (use chunked processing for longer audio)

Usage

See the GitHub repo for full inference code: acul3/chatterbox-executorch

# Clone code
git clone https://github.com/acul3/chatterbox-executorch.git
cd chatterbox-executorch

# Download models (this repo)
python -c "
from huggingface_hub import snapshot_download
snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model')
"

# Run full PTE inference
python test_true_full_pte.py

Android Integration

These models are designed for Android deployment via the ExecuTorch Android SDK. Load with:

val module = Module.load(context.filesDir.path + "/t3_prefill.pte")

With QNN/NPU delegation on a Snapdragon device, expect 10–50Γ— speedup over the CPU timings below.

Performance (Jetson AGX Orin, CPU only)

Stage Time
Voice encoding ~1s
T3 prefill ~22s
T3 decode (~100 tokens) 800s total (8s/token)
S3Gen encoder ~2s
CFM (2 steps) ~40s
HiFiGAN ~10s/chunk

License

Model weights are derived from Resemble AI's Chatterbox. The export pipeline code is MIT licensed. Please refer to the original Chatterbox license for model weights usage terms.