acul3 commited on
Commit
d53caf4
Β·
verified Β·
1 Parent(s): c92972d

Upload 9 ExecuTorch .pte models (FP16, 2.6GB total)

Browse files
.gitattributes CHANGED
@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ cfm_step.pte filter=lfs diff=lfs merge=lfs -text
37
+ hifigan.pte filter=lfs diff=lfs merge=lfs -text
38
+ s3gen_encoder.pte filter=lfs diff=lfs merge=lfs -text
39
+ t3_cond_enc.pte filter=lfs diff=lfs merge=lfs -text
40
+ t3_cond_speech_emb.pte filter=lfs diff=lfs merge=lfs -text
41
+ t3_decode.pte filter=lfs diff=lfs merge=lfs -text
42
+ t3_prefill.pte filter=lfs diff=lfs merge=lfs -text
43
+ voice_encoder.pte filter=lfs diff=lfs merge=lfs -text
44
+ xvector_encoder.pte filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - en
5
+ - zh
6
+ - fr
7
+ - de
8
+ - es
9
+ - ja
10
+ - ko
11
+ - pt
12
+ - it
13
+ - ru
14
+ - ar
15
+ - hi
16
+ - tr
17
+ - pl
18
+ - nl
19
+ - sv
20
+ - da
21
+ - fi
22
+ - no
23
+ - cs
24
+ - ro
25
+ - hu
26
+ tags:
27
+ - text-to-speech
28
+ - executorch
29
+ - on-device
30
+ - android
31
+ - voice-cloning
32
+ - chatterbox
33
+ license: apache-2.0
34
+ ---
35
+
36
+ # Chatterbox Multilingual TTS β€” ExecuTorch Models
37
+
38
+ Pre-exported `.pte` model files for running [Resemble AI's Chatterbox Multilingual TTS](https://github.com/resemble-ai/chatterbox) fully on-device using [ExecuTorch](https://pytorch.org/executorch/).
39
+
40
+ **πŸ“¦ Code & export scripts:** [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch) on GitHub
41
+
42
+ ---
43
+
44
+ ## What's Here
45
+
46
+ 9 ExecuTorch `.pte` files covering the complete TTS pipeline β€” from text input to 24kHz waveform β€” with zero PyTorch runtime required:
47
+
48
+ | File | Size | Backend | Precision | Stage |
49
+ |------|------|---------|-----------|-------|
50
+ | `voice_encoder.pte` | 7 MB | portable | FP32 | Speaker embedding |
51
+ | `xvector_encoder.pte` | 27 MB | portable | FP32 | X-vector conditioning |
52
+ | `t3_cond_speech_emb.pte` | 49 MB | portable | FP32 | Speech token embedding |
53
+ | `t3_cond_enc.pte` | 18 MB | portable | FP32 | Text/conditioning encoder |
54
+ | `t3_prefill.pte` | 1010 MB | XNNPACK | **FP16** | T3 Transformer prefill |
55
+ | `t3_decode.pte` | 1002 MB | XNNPACK | **FP16** | T3 Transformer decode |
56
+ | `s3gen_encoder.pte` | 178 MB | portable | FP32 | S3Gen Conformer encoder |
57
+ | `cfm_step.pte` | 274 MB | XNNPACK | FP32 | CFM flow matching step |
58
+ | `hifigan.pte` | 84 MB | XNNPACK | FP32 | HiFiGAN vocoder |
59
+ | **Total** | **~2.6 GB** | | | |
60
+
61
+ ---
62
+
63
+ ## Quick Download
64
+
65
+ ```python
66
+ from huggingface_hub import snapshot_download
67
+
68
+ snapshot_download(
69
+ "acul3/chatterbox-executorch",
70
+ local_dir="et_models",
71
+ repo_type="model"
72
+ )
73
+ ```
74
+
75
+ ---
76
+
77
+ ## Pipeline Overview
78
+
79
+ ```
80
+ Text β†’ MTLTokenizer β†’ text tokens
81
+ Reference Audio β†’ VoiceEncoder + CAMPPlus β†’ speaker conditioning
82
+ ↓
83
+ T3 Prefill (LlamaModel, conditioned)
84
+ ↓
85
+ T3 Decode (autoregressive, ~100 tokens)
86
+ ↓
87
+ S3Gen Encoder (Conformer)
88
+ ↓
89
+ CFM Step Γ— 2 (flow matching)
90
+ ↓
91
+ HiFiGAN (vocoder, chunked)
92
+ ↓
93
+ 24kHz PCM waveform 🎡
94
+ ```
95
+
96
+ ---
97
+
98
+ ## Key Technical Notes
99
+
100
+ - **T3 Decode** uses a manually unrolled 30-layer Llama forward pass with static KV cache (`torch.where` writes) β€” bypasses HF `DynamicCache` for `torch.export` compatibility
101
+ - **HiFiGAN** uses a manual real-valued DFT (cosine/sine matrix multiply) β€” replaces `torch.stft`/`torch.istft` which XNNPACK doesn't support
102
+ - **T3 models** are FP16 (XNNPACK half-precision kernels) β€” ~half the size of FP32 with near-identical quality
103
+ - **Fixed shapes:** CFM expects `T_MEL=2200`, HiFiGAN expects `T_MEL=300` (use chunked processing for longer audio)
104
+
105
+ ---
106
+
107
+ ## Usage
108
+
109
+ See the GitHub repo for full inference code: [acul3/chatterbox-executorch](https://github.com/acul3/chatterbox-executorch)
110
+
111
+ ```bash
112
+ # Clone code
113
+ git clone https://github.com/acul3/chatterbox-executorch.git
114
+ cd chatterbox-executorch
115
+
116
+ # Download models (this repo)
117
+ python -c "
118
+ from huggingface_hub import snapshot_download
119
+ snapshot_download('acul3/chatterbox-executorch', local_dir='et_models', repo_type='model')
120
+ "
121
+
122
+ # Run full PTE inference
123
+ python test_true_full_pte.py
124
+ ```
125
+
126
+ ---
127
+
128
+ ## Android Integration
129
+
130
+ These models are designed for Android deployment via the [ExecuTorch Android SDK](https://pytorch.org/executorch/stable/android-setup.html). Load with:
131
+
132
+ ```kotlin
133
+ val module = Module.load(context.filesDir.path + "/t3_prefill.pte")
134
+ ```
135
+
136
+ With QNN/NPU delegation on a Snapdragon device, expect **10–50Γ— speedup** over the CPU timings below.
137
+
138
+ ## Performance (Jetson AGX Orin, CPU only)
139
+
140
+ | Stage | Time |
141
+ |-------|------|
142
+ | Voice encoding | ~1s |
143
+ | T3 prefill | ~22s |
144
+ | T3 decode (~100 tokens) | ~800s total (~8s/token) |
145
+ | S3Gen encoder | ~2s |
146
+ | CFM (2 steps) | ~40s |
147
+ | HiFiGAN | ~10s/chunk |
148
+
149
+ ---
150
+
151
+ ## License
152
+
153
+ Model weights are derived from [Resemble AI's Chatterbox](https://github.com/resemble-ai/chatterbox). The export pipeline code is MIT licensed. Please refer to the original [Chatterbox license](https://github.com/resemble-ai/chatterbox/blob/main/LICENSE) for model weights usage terms.
cfm_step.pte ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ffb90558c0dbada80ac94bd9a6864101cc07826c0f9192dfd0363d190922079
3
+ size 286434240
hifigan.pte ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5239799e82fb2be2aeb63db2de9bef676d2decf7905692a5ccafac7ae3530e2
3
+ size 83634944
s3gen_encoder.pte ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b155f79c3e0ce7a024a31a607ea0e0a0c42e91a7a17e409ba9219d88f360e925
3
+ size 185724096
t3_cond_enc.pte ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2e953bfa6e9289a7a97ca527592aed0e6d2cb58b29126d5780410455f091ac3d
3
+ size 18011520
t3_cond_speech_emb.pte ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66fa3ea798833a6a5fb154d3a40e62329eefe7b7e22ae24bcc8c44927edbfc23
3
+ size 50358144
t3_decode.pte ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:080bfa2ac224539c98286e13a0e378d2af5ae7ef93ad08901ec58f22015baf15
3
+ size 1049700480
t3_prefill.pte ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ad54c9a30841ae0a9b7fd9c6156e7b84bc324255776d569a6647a24d8bd15eb
3
+ size 1058796928
voice_encoder.pte ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9912a653299368950c633376aa47e48370384ab7311e37821288b8b63db1cc91
3
+ size 7583744
xvector_encoder.pte ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bdfe552944f941545dbc7389b8c05e4fd7e7a47913a2dd64aa8c171459ce5eb
3
+ size 28070944