Upload 9 ExecuTorch .pte models (FP16, 2.6GB total)
Browse files- .gitattributes +9 -0
- README.md +153 -0
- cfm_step.pte +3 -0
- hifigan.pte +3 -0
- s3gen_encoder.pte +3 -0
- t3_cond_enc.pte +3 -0
- t3_cond_speech_emb.pte +3 -0
- t3_decode.pte +3 -0
- t3_prefill.pte +3 -0
- voice_encoder.pte +3 -0
- xvector_encoder.pte +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
cfm_step.pte filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
hifigan.pte filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
s3gen_encoder.pte filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
t3_cond_enc.pte filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
t3_cond_speech_emb.pte filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
t3_decode.pte filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
t3_prefill.pte filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
voice_encoder.pte filter=lfs diff=lfs merge=lfs -text
|
| 44 |
+
xvector_encoder.pte filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- multilingual
|
| 4 |
+
- en
|
| 5 |
+
- zh
|
| 6 |
+
- fr
|
| 7 |
+
- de
|
| 8 |
+
- es
|
| 9 |
+
- ja
|
| 10 |
+
- ko
|
| 11 |
+
- pt
|
| 12 |
+
- it
|
| 13 |
+
- ru
|
| 14 |
+
- ar
|
| 15 |
+
- hi
|
| 16 |
+
- tr
|
| 17 |
+
- pl
|
| 18 |
+
- nl
|
| 19 |
+
- sv
|
| 20 |
+
- da
|
| 21 |
+
- fi
|
| 22 |
+
- no
|
| 23 |
+
- cs
|
| 24 |
+
- ro
|
| 25 |
+
- hu
|
| 26 |
+
tags:
|
| 27 |
+
- text-to-speech
|
| 28 |
+
- executorch
|
| 29 |
+
- on-device
|
| 30 |
+
- android
|
| 31 |
+
- voice-cloning
|
| 32 |
+
- chatterbox
|
| 33 |
+
license: apache-2.0
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
# Chatterbox Multilingual TTS β ExecuTorch Models
|
| 37 |
+
|
| 38 |
+
Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/).
|
| 39 |
+
|
| 40 |
+
**π¦ Code & export scripts:** [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## What's Here
|
| 45 |
+
|
| 46 |
+
9 ExecuTorch `.pte` files covering the complete TTS pipeline β from text input to 24kHz waveform β with zero PyTorch runtime required:
|
| 47 |
+
|
| 48 |
+
| File | Size | Backend | Precision | Stage |
|
| 49 |
+
|------|------|---------|-----------|-------|
|
| 50 |
+
| `voice_encoder.pte` | 7 MB | portable | FP32 | Speaker embedding |
|
| 51 |
+
| `xvector_encoder.pte` | 27 MB | portable | FP32 | X-vector conditioning |
|
| 52 |
+
| `t3_cond_speech_emb.pte` | 49 MB | portable | FP32 | Speech token embedding |
|
| 53 |
+
| `t3_cond_enc.pte` | 18 MB | portable | FP32 | Text/conditioning encoder |
|
| 54 |
+
| `t3_prefill.pte` | 1010 MB | XNNPACK | **FP16** | T3 Transformer prefill |
|
| 55 |
+
| `t3_decode.pte` | 1002 MB | XNNPACK | **FP16** | T3 Transformer decode |
|
| 56 |
+
| `s3gen_encoder.pte` | 178 MB | portable | FP32 | S3Gen Conformer encoder |
|
| 57 |
+
| `cfm_step.pte` | 274 MB | XNNPACK | FP32 | CFM flow matching step |
|
| 58 |
+
| `hifigan.pte` | 84 MB | XNNPACK | FP32 | HiFiGAN vocoder |
|
| 59 |
+
| **Total** | **~2.6 GB** | | | |
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## Quick Download
|
| 64 |
+
|
| 65 |
+
```python
|
| 66 |
+
from huggingface_hub import snapshot_download
|
| 67 |
+
|
| 68 |
+
snapshot_download(
|
| 69 |
+
"acul3/chatterbox-executorch",
|
| 70 |
+
local_dir="et_models",
|
| 71 |
+
repo_type="model"
|
| 72 |
+
)
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## Pipeline Overview
|
| 78 |
+
|
| 79 |
+
```
|
| 80 |
+
Text β MTLTokenizer β text tokens
|
| 81 |
+
Reference Audio β VoiceEncoder + CAMPPlus β speaker conditioning
|
| 82 |
+
β
|
| 83 |
+
T3 Prefill (LlamaModel, conditioned)
|
| 84 |
+
β
|
| 85 |
+
T3 Decode (autoregressive, ~100 tokens)
|
| 86 |
+
β
|
| 87 |
+
S3Gen Encoder (Conformer)
|
| 88 |
+
β
|
| 89 |
+
CFM Step Γ 2 (flow matching)
|
| 90 |
+
β
|
| 91 |
+
HiFiGAN (vocoder, chunked)
|
| 92 |
+
β
|
| 93 |
+
24kHz PCM waveform π΅
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## Key Technical Notes
|
| 99 |
+
|
| 100 |
+
- **T3 Decode** uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) β bypasses HF `DynamicCache` for `torch.export` compatibility
|
| 101 |
+
- **HiFiGAN** uses a manual real-valued DFT (cosine/sine matrix multiply) β replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support
|
| 102 |
+
- **T3 models** are FP16 (XNNPACK half-precision kernels) β ~half the size of FP32 with near-identical quality
|
| 103 |
+
- **Fixed shapes:** CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio)
|
| 104 |
+
|
| 105 |
+
---
|
| 106 |
+
|
| 107 |
+
## Usage
|
| 108 |
+
|
| 109 |
+
See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch)
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
# Clone code
|
| 113 |
+
git clone https://github.com/acul3/chatterbox-executorch.git
|
| 114 |
+
cd chatterbox-executorch
|
| 115 |
+
|
| 116 |
+
# Download models (this repo)
|
| 117 |
+
python -c "
|
| 118 |
+
from huggingface_hub import snapshot_download
|
| 119 |
+
snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model')
|
| 120 |
+
"
|
| 121 |
+
|
| 122 |
+
# Run full PTE inference
|
| 123 |
+
python test_true_full_pte.py
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
---
|
| 127 |
+
|
| 128 |
+
## Android Integration
|
| 129 |
+
|
| 130 |
+
These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with:
|
| 131 |
+
|
| 132 |
+
```kotlin
|
| 133 |
+
val module = Module.load(context.filesDir.path + "/t3_prefill.pte")
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
With QNN/NPU delegation on a Snapdragon device, expect **10β50Γ speedup** over the CPU timings below.
|
| 137 |
+
|
| 138 |
+
## Performance (Jetson AGX Orin, CPU only)
|
| 139 |
+
|
| 140 |
+
| Stage | Time |
|
| 141 |
+
|-------|------|
|
| 142 |
+
| Voice encoding | ~1s |
|
| 143 |
+
| T3 prefill | ~22s |
|
| 144 |
+
| T3 decode (~100 tokens) | ~800s total (~8s/token) |
|
| 145 |
+
| S3Gen encoder | ~2s |
|
| 146 |
+
| CFM (2 steps) | ~40s |
|
| 147 |
+
| HiFiGAN | ~10s/chunk |
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
## License
|
| 152 |
+
|
| 153 |
+
Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms.
|
cfm_step.pte
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5ffb90558c0dbada80ac94bd9a6864101cc07826c0f9192dfd0363d190922079
|
| 3 |
+
size 286434240
|
hifigan.pte
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f5239799e82fb2be2aeb63db2de9bef676d2decf7905692a5ccafac7ae3530e2
|
| 3 |
+
size 83634944
|
s3gen_encoder.pte
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b155f79c3e0ce7a024a31a607ea0e0a0c42e91a7a17e409ba9219d88f360e925
|
| 3 |
+
size 185724096
|
t3_cond_enc.pte
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2e953bfa6e9289a7a97ca527592aed0e6d2cb58b29126d5780410455f091ac3d
|
| 3 |
+
size 18011520
|
t3_cond_speech_emb.pte
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:66fa3ea798833a6a5fb154d3a40e62329eefe7b7e22ae24bcc8c44927edbfc23
|
| 3 |
+
size 50358144
|
t3_decode.pte
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:080bfa2ac224539c98286e13a0e378d2af5ae7ef93ad08901ec58f22015baf15
|
| 3 |
+
size 1049700480
|
t3_prefill.pte
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3ad54c9a30841ae0a9b7fd9c6156e7b84bc324255776d569a6647a24d8bd15eb
|
| 3 |
+
size 1058796928
|
voice_encoder.pte
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9912a653299368950c633376aa47e48370384ab7311e37821288b8b63db1cc91
|
| 3 |
+
size 7583744
|
xvector_encoder.pte
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4bdfe552944f941545dbc7389b8c05e4fd7e7a47913a2dd64aa8c171459ce5eb
|
| 3 |
+
size 28070944
|