RASMUS commited on
Commit
87d8c97
Β·
verified Β·
1 Parent(s): e1b3c22

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +150 -0
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - fi
5
+ pipeline_tag: text-to-speech
6
+ tags:
7
+ - text-to-speech
8
+ - finnish
9
+ - onnx
10
+ - webgpu
11
+ - voice-cloning
12
+ base_model:
13
+ - ResembleAI/chatterbox
14
+ - Finnish-NLP/Chatterbox-Finnish
15
+ ---
16
+
17
+ # Chatterbox Finnish β€” ONNX / WebGPU
18
+
19
+ Finnish fine-tuned [Chatterbox TTS](https://github.com/resemble-ai/chatterbox) exported to ONNX for browser inference via **WebGPU + transformers.js / ONNX Runtime Web**.
20
+
21
+ Based on the [Finnish-NLP/Chatterbox-Finnish](https://huggingface.co/Finnish-NLP/Chatterbox-Finnish) fine-tune.
22
+
23
+ ---
24
+
25
+ ## Repository Contents
26
+
27
+ ```
28
+ onnx/
29
+ language_model.onnx # Finnish fine-tuned T3 LM (fp32, ~2 GB)
30
+ language_model.onnx_data
31
+ finnish_cond_emb.bin # Precomputed Finnish conditioning embedding [1, 34, 1024]
32
+ finnish_cond_emb_meta.json # Metadata for the above
33
+
34
+ scripts/
35
+ compare_onnx_vs_pytorch.py # Browser-worker simulator + PyTorch parity checker
36
+ browser_pipeline_sim.py # Python mirror of the browser WebGPU worker
37
+ analyze_audio.py # MOS (Gemini) + WER (Groq Whisper) quality evaluator
38
+ export_finnish_embeddings.py # Exports embed_tokens.onnx + voice_encoder.onnx
39
+
40
+ samples/
41
+ reference_finnish.wav # Reference voice for zero-shot Finnish TTS
42
+ ```
43
+
44
+ **Base model components** (from [onnx-community/chatterbox-multilingual-ONNX](https://huggingface.co/onnx-community/chatterbox-multilingual-ONNX)):
45
+ - `onnx/speech_encoder.onnx` β€” reference audio β†’ prompt tokens + speaker embeddings
46
+ - `onnx/embed_tokens.onnx` β€” text token embeddings
47
+ - `onnx/conditional_decoder.onnx` β€” speech tokens β†’ waveform (S3Gen flow + HiFiGAN)
48
+
49
+ ---
50
+
51
+ ## Pipeline Architecture
52
+
53
+ ```
54
+ Reference audio (24 kHz)
55
+ β”‚
56
+ β–Ό
57
+ speech_encoder ──────────► prompt_tokens [1, N]
58
+ β”‚ speaker_emb [1, 192]
59
+ β”‚ speaker_features [1, T, 80] (mel)
60
+ β”‚
61
+ finnish_cond_emb.bin ─────► cond_emb [1, 34, 1024] (Finnish voice conditioning)
62
+ β”‚
63
+ β–Ό
64
+ Text ──► EnTokenizer ──► embed_tokens ──► text_embeds [1, T, 1024]
65
+ β”‚
66
+ β–Ό
67
+ Language Model (Finnish T3) β€” CFG, cfg_weight=0.5
68
+ Conditioned: [cond_emb | text_embeds | BOS] β†’ speech tokens
69
+ Unconditioned: [cond_emb | zeros | BOS] β†’ speech tokens
70
+ Final logits = cond + 0.5 * (cond - uncond)
71
+ β”‚
72
+ β–Ό generated speech tokens [1, N_gen]
73
+ β”‚
74
+ β”œβ”€β”€ prepend prompt_tokens ──► [prompt_tokens | generated] [1, N_prompt + N_gen]
75
+ β”‚
76
+ β–Ό
77
+ conditional_decoder (speaker_emb, speaker_features)
78
+ β”‚
79
+ β–Ό
80
+ waveform (24 kHz)
81
+ ```
82
+
83
+ > **Critical**: The `conditional_decoder` uses a CosyVoice-style flow model.
84
+ > You **must** prepend `prompt_tokens` (from `speech_encoder`) to the generated tokens
85
+ > before calling the decoder. Without this, you get ~0.18 s of noise instead of speech.
86
+
87
+ ---
88
+
89
+ ## Python Usage
90
+
91
+ Install dependencies:
92
+ ```bash
93
+ pip install onnxruntime-gpu huggingface_hub librosa soundfile numpy
94
+ # Plus Chatterbox-Finnish for EnTokenizer:
95
+ # git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish
96
+ ```
97
+
98
+ Run the browser-worker simulator (mirrors the WebGPU worker logic):
99
+ ```bash
100
+ # Full parity check (PyTorch vs ONNX)
101
+ LD_LIBRARY_PATH=/path/to/cudnn/lib python scripts/compare_onnx_vs_pytorch.py --mode parity
102
+
103
+ # ONNX only (skip PyTorch)
104
+ python scripts/compare_onnx_vs_pytorch.py --mode parity --skip-pytorch
105
+
106
+ # Component-level debug
107
+ python scripts/compare_onnx_vs_pytorch.py --mode debug
108
+ ```
109
+
110
+ Key generation parameters (matching `inference_example.py`):
111
+ | Parameter | Value |
112
+ |---|---|
113
+ | `repetition_penalty` | 1.2 |
114
+ | `temperature` | 0.8 |
115
+ | `exaggeration` | 0.6 |
116
+ | `cfg_weight` | 0.5 |
117
+ | `min_p` | 0.05 |
118
+
119
+ ---
120
+
121
+ ## Quality Results
122
+
123
+ Evaluated with Gemini 2.5 Flash (MOS) and Groq Whisper (WER):
124
+
125
+ | | PyTorch | ONNX |
126
+ |---|---|---|
127
+ | MOS (1–5) | 3.0 | 3.0 |
128
+ | WER | 20% | 20% |
129
+ | MFCC cosine | β€” | 0.996 |
130
+ | Duration | 5.98 s | ~5.6 s |
131
+
132
+ Waveforms differ (mel cosine ~0.65–0.75) due to stochastic sampling and different conditioning voice, but phonetic content is nearly identical (MFCC = 0.996).
133
+
134
+ ---
135
+
136
+ ## Known Limitations
137
+
138
+ 1. **Fixed conditioning voice**: `finnish_cond_emb.bin` was computed from a specific reference recording using the Finnish `cond_enc` weights. Custom reference audio changes speaker identity (via `speaker_emb` + `speaker_features`) but not the T3 conditioning. A `finnish_cond_enc.onnx` export would fix this β€” see `scripts/export_finnish_embeddings.py`.
139
+
140
+ 2. **Watermarking skipped**: The PyTorch model applies Perth watermarking; the ONNX pipeline does not.
141
+
142
+ 3. **Minimum token length**: The decoder requires the combined `[prompt_tokens | generated_tokens]` to be at least ~150 tokens, otherwise an Expand error occurs.
143
+
144
+ ---
145
+
146
+ ## Related
147
+
148
+ - [Finnish-NLP/Chatterbox-Finnish](https://huggingface.co/Finnish-NLP/Chatterbox-Finnish) β€” PyTorch fine-tune + training code
149
+ - [onnx-community/chatterbox-multilingual-ONNX](https://huggingface.co/onnx-community/chatterbox-multilingual-ONNX) β€” base ONNX components
150
+ - [ResembleAI/chatterbox](https://github.com/resemble-ai/chatterbox) β€” original model