Upload 9 ExecuTorch .pte models (FP16, 2.6GB total)

d53caf4 verified 12 days ago

4.35 kB

language:
  - multilingual
  - en
  - zh
  - fr
  - de
  - es
  - ja
  - ko
  - pt
  - it
  - ru
  - ar
  - hi
  - tr
  - pl
  - nl
  - sv
  - da
  - fi
  - 'no'
  - cs
  - ro
  - hu
tags:
  - text-to-speech
  - executorch
  - on-device
  - android
  - voice-cloning
  - chatterbox
license: apache-2.0

Chatterbox Multilingual TTS — ExecuTorch Models

Pre-exported .pte model files for running Resemble AI's Chatterbox Multilingual TTS fully on-device using ExecuTorch.

📦 Code & export scripts: acul3/chatterbox-executorch on GitHub

What's Here

9 ExecuTorch .pte files covering the complete TTS pipeline — from text input to 24kHz waveform — with zero PyTorch runtime required:

File	Size	Backend	Precision	Stage
`voice_encoder.pte`	7 MB	portable	FP32	Speaker embedding
`xvector_encoder.pte`	27 MB	portable	FP32	X-vector conditioning
`t3_cond_speech_emb.pte`	49 MB	portable	FP32	Speech token embedding
`t3_cond_enc.pte`	18 MB	portable	FP32	Text/conditioning encoder
`t3_prefill.pte`	1010 MB	XNNPACK	FP16	T3 Transformer prefill
`t3_decode.pte`	1002 MB	XNNPACK	FP16	T3 Transformer decode
`s3gen_encoder.pte`	178 MB	portable	FP32	S3Gen Conformer encoder
`cfm_step.pte`	274 MB	XNNPACK	FP32	CFM flow matching step
`hifigan.pte`	84 MB	XNNPACK	FP32	HiFiGAN vocoder
Total	~2.6 GB

Quick Download

from huggingface_hub import snapshot_download

snapshot_download(
    "acul3/chatterbox-executorch",
    local_dir="et_models",
    repo_type="model"
)

Pipeline Overview

Text → MTLTokenizer → text tokens
Reference Audio → VoiceEncoder + CAMPPlus → speaker conditioning
                          ↓
              T3 Prefill (LlamaModel, conditioned)
                          ↓
              T3 Decode (autoregressive, ~100 tokens)
                          ↓
              S3Gen Encoder (Conformer)
                          ↓
              CFM Step × 2 (flow matching)
                          ↓
              HiFiGAN (vocoder, chunked)
                          ↓
              24kHz PCM waveform 🎵

Key Technical Notes

T3 Decode uses a manually unrolled 30-layer Llama forward pass with static KV cache (torch.where writes) — bypasses HF DynamicCache for torch.export compatibility
HiFiGAN uses a manual real-valued DFT (cosine/sine matrix multiply) — replaces torch.stft/torch.istft which XNNPACK doesn't support
T3 models are FP16 (XNNPACK half-precision kernels) — ~half the size of FP32 with near-identical quality
Fixed shapes: CFM expects T_MEL=2200, HiFiGAN expects T_MEL=300 (use chunked processing for longer audio)

Usage

See the GitHub repo for full inference code: acul3/chatterbox-executorch

# Clone code
git clone https://github.com/acul3/chatterbox-executorch.git
cd chatterbox-executorch

# Download models (this repo)
python -c "
from huggingface_hub import snapshot_download
snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model')
"

# Run full PTE inference
python test_true_full_pte.py

Android Integration

These models are designed for Android deployment via the ExecuTorch Android SDK. Load with:

val module = Module.load(context.filesDir.path + "/t3_prefill.pte")

With QNN/NPU delegation on a Snapdragon device, expect 10–50× speedup over the CPU timings below.

Performance (Jetson AGX Orin, CPU only)

Stage	Time
Voice encoding	~1s
T3 prefill	~22s
T3 decode (~100 tokens)	~~800s total (~~8s/token)
S3Gen encoder	~2s
CFM (2 steps)	~40s
HiFiGAN	~10s/chunk

License

Model weights are derived from Resemble AI's Chatterbox. The export pipeline code is MIT licensed. Please refer to the original Chatterbox license for model weights usage terms.