Upload 9 ExecuTorch .pte models (FP16, 2.6GB total)

Browse files

Files changed (11) hide show

.gitattributes +9 -0
README.md +153 -0
cfm_step.pte +3 -0
hifigan.pte +3 -0
s3gen_encoder.pte +3 -0
t3_cond_enc.pte +3 -0
t3_cond_speech_emb.pte +3 -0
t3_decode.pte +3 -0
t3_prefill.pte +3 -0
voice_encoder.pte +3 -0
xvector_encoder.pte +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+cfm_step.pte filter=lfs diff=lfs merge=lfs -text
+hifigan.pte filter=lfs diff=lfs merge=lfs -text
+s3gen_encoder.pte filter=lfs diff=lfs merge=lfs -text
+t3_cond_enc.pte filter=lfs diff=lfs merge=lfs -text
+t3_cond_speech_emb.pte filter=lfs diff=lfs merge=lfs -text
+t3_decode.pte filter=lfs diff=lfs merge=lfs -text
+t3_prefill.pte filter=lfs diff=lfs merge=lfs -text
+voice_encoder.pte filter=lfs diff=lfs merge=lfs -text
+xvector_encoder.pte filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,153 @@

+---
+language:
+- multilingual
+- en
+- zh
+- fr
+- de
+- es
+- ja
+- ko
+- pt
+- it
+- ru
+- ar
+- hi
+- tr
+- pl
+- nl
+- sv
+- da
+- fi
+- no
+- cs
+- ro
+- hu
+tags:
+- text-to-speech
+- executorch
+- on-device
+- android
+- voice-cloning
+- chatterbox
+license: apache-2.0
+---
+# Chatterbox Multilingual TTS — ExecuTorch Models
+Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/).
+**📦 Code & export scripts:** [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub
+---
+## What's Here
+9 ExecuTorch `.pte` files covering the complete TTS pipeline — from text input to 24kHz waveform — with zero PyTorch runtime required:
+| File | Size | Backend | Precision | Stage |
+|------|------|---------|-----------|-------|
+| `voice_encoder.pte` | 7 MB | portable | FP32 | Speaker embedding |
+| `xvector_encoder.pte` | 27 MB | portable | FP32 | X-vector conditioning |
+| `t3_cond_speech_emb.pte` | 49 MB | portable | FP32 | Speech token embedding |
+| `t3_cond_enc.pte` | 18 MB | portable | FP32 | Text/conditioning encoder |
+| `t3_prefill.pte` | 1010 MB | XNNPACK | **FP16** | T3 Transformer prefill |
+| `t3_decode.pte` | 1002 MB | XNNPACK | **FP16** | T3 Transformer decode |
+| `s3gen_encoder.pte` | 178 MB | portable | FP32 | S3Gen Conformer encoder |
+| `cfm_step.pte` | 274 MB | XNNPACK | FP32 | CFM flow matching step |
+| `hifigan.pte` | 84 MB | XNNPACK | FP32 | HiFiGAN vocoder |
+| **Total** | **~2.6 GB** | | | |
+---
+## Quick Download
+```python
+from huggingface_hub import snapshot_download
+snapshot_download(
+    "acul3/chatterbox-executorch",
+    local_dir="et_models",
+    repo_type="model"
+)
+```
+---
+## Pipeline Overview
+```
+Text → MTLTokenizer → text tokens
+Reference Audio → VoiceEncoder + CAMPPlus → speaker conditioning
+                          ↓
+              T3 Prefill (LlamaModel, conditioned)
+                          ↓
+              T3 Decode (autoregressive, ~100 tokens)
+                          ↓
+              S3Gen Encoder (Conformer)
+                          ↓
+              CFM Step × 2 (flow matching)
+                          ↓
+              HiFiGAN (vocoder, chunked)
+                          ↓
+              24kHz PCM waveform 🎵
+```
+---
+## Key Technical Notes
+- **T3 Decode** uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) — bypasses HF `DynamicCache` for `torch.export` compatibility
+- **HiFiGAN** uses a manual real-valued DFT (cosine/sine matrix multiply) — replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support
+- **T3 models** are FP16 (XNNPACK half-precision kernels) — ~half the size of FP32 with near-identical quality
+- **Fixed shapes:** CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio)
+---
+## Usage
+See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch)
+```bash
+# Clone code
+git clone https://github.com/acul3/chatterbox-executorch.git
+cd chatterbox-executorch
+# Download models (this repo)
+python -c "
+from huggingface_hub import snapshot_download
+snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model')
+"
+# Run full PTE inference
+python test_true_full_pte.py
+```
+---
+## Android Integration
+These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with:
+```kotlin
+val module = Module.load(context.filesDir.path + "/t3_prefill.pte")
+```
+With QNN/NPU delegation on a Snapdragon device, expect **10–50× speedup** over the CPU timings below.
+## Performance (Jetson AGX Orin, CPU only)
+| Stage | Time |
+|-------|------|
+| Voice encoding | ~1s |
+| T3 prefill | ~22s |
+| T3 decode (~100 tokens) | ~800s total (~8s/token) |
+| S3Gen encoder | ~2s |
+| CFM (2 steps) | ~40s |
+| HiFiGAN | ~10s/chunk |
+---
+## License
+Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms.

cfm_step.pte ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5ffb90558c0dbada80ac94bd9a6864101cc07826c0f9192dfd0363d190922079
+size 286434240

hifigan.pte ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5239799e82fb2be2aeb63db2de9bef676d2decf7905692a5ccafac7ae3530e2
+size 83634944

s3gen_encoder.pte ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b155f79c3e0ce7a024a31a607ea0e0a0c42e91a7a17e409ba9219d88f360e925
+size 185724096

t3_cond_enc.pte ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2e953bfa6e9289a7a97ca527592aed0e6d2cb58b29126d5780410455f091ac3d
+size 18011520

t3_cond_speech_emb.pte ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:66fa3ea798833a6a5fb154d3a40e62329eefe7b7e22ae24bcc8c44927edbfc23
+size 50358144

t3_decode.pte ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:080bfa2ac224539c98286e13a0e378d2af5ae7ef93ad08901ec58f22015baf15
+size 1049700480

t3_prefill.pte ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3ad54c9a30841ae0a9b7fd9c6156e7b84bc324255776d569a6647a24d8bd15eb
+size 1058796928

voice_encoder.pte ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9912a653299368950c633376aa47e48370384ab7311e37821288b8b63db1cc91
+size 7583744

xvector_encoder.pte ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4bdfe552944f941545dbc7389b8c05e4fd7e7a47913a2dd64aa8c171459ce5eb
+size 28070944