RASMUS
/

Chatterbox-Finnish-ONNX

+---
+license: mit
+language:
+- fi
+pipeline_tag: text-to-speech
+tags:
+- text-to-speech
+- finnish
+- onnx
+- webgpu
+- voice-cloning
+base_model:
+- ResembleAI/chatterbox
+- Finnish-NLP/Chatterbox-Finnish
+---
+# Chatterbox Finnish — ONNX / WebGPU
+Finnish fine-tuned [Chatterbox TTS](https://github.com/resemble-ai/chatterbox) exported to ONNX for browser inference via **WebGPU + transformers.js / ONNX Runtime Web**.
+Based on the [Finnish-NLP/Chatterbox-Finnish](https://huggingface.co/Finnish-NLP/Chatterbox-Finnish) fine-tune.
+---
+## Repository Contents
+```
+onnx/
+  language_model.onnx          # Finnish fine-tuned T3 LM (fp32, ~2 GB)
+  language_model.onnx_data
+  finnish_cond_emb.bin         # Precomputed Finnish conditioning embedding [1, 34, 1024]
+  finnish_cond_emb_meta.json   # Metadata for the above
+scripts/
+  compare_onnx_vs_pytorch.py   # Browser-worker simulator + PyTorch parity checker
+  browser_pipeline_sim.py      # Python mirror of the browser WebGPU worker
+  analyze_audio.py             # MOS (Gemini) + WER (Groq Whisper) quality evaluator
+  export_finnish_embeddings.py # Exports embed_tokens.onnx + voice_encoder.onnx
+samples/
+  reference_finnish.wav        # Reference voice for zero-shot Finnish TTS
+```
+**Base model components** (from [onnx-community/chatterbox-multilingual-ONNX](https://huggingface.co/onnx-community/chatterbox-multilingual-ONNX)):
+- `onnx/speech_encoder.onnx` — reference audio → prompt tokens + speaker embeddings
+- `onnx/embed_tokens.onnx` — text token embeddings
+- `onnx/conditional_decoder.onnx` — speech tokens → waveform (S3Gen flow + HiFiGAN)
+---
+## Pipeline Architecture
+```
+Reference audio (24 kHz)
+    │
+    ▼
+speech_encoder ──────────► prompt_tokens [1, N]
+    │                       speaker_emb  [1, 192]
+    │                       speaker_features [1, T, 80]  (mel)
+    │
+finnish_cond_emb.bin ─────► cond_emb [1, 34, 1024]  (Finnish voice conditioning)
+    │
+    ▼
+Text ──► EnTokenizer ──► embed_tokens ──► text_embeds [1, T, 1024]
+    │
+    ▼
+Language Model (Finnish T3) — CFG, cfg_weight=0.5
+  Conditioned:   [cond_emb | text_embeds | BOS] → speech tokens
+  Unconditioned: [cond_emb | zeros       | BOS] → speech tokens
+  Final logits = cond + 0.5 * (cond - uncond)
+    │
+    ▼ generated speech tokens [1, N_gen]
+    │
+    ├── prepend prompt_tokens ──► [prompt_tokens | generated] [1, N_prompt + N_gen]
+    │
+    ▼
+conditional_decoder (speaker_emb, speaker_features)
+    │
+    ▼
+waveform (24 kHz)
+```
+> **Critical**: The `conditional_decoder` uses a CosyVoice-style flow model.
+> You **must** prepend `prompt_tokens` (from `speech_encoder`) to the generated tokens
+> before calling the decoder. Without this, you get ~0.18 s of noise instead of speech.
+---
+## Python Usage
+Install dependencies:
+```bash
+pip install onnxruntime-gpu huggingface_hub librosa soundfile numpy
+# Plus Chatterbox-Finnish for EnTokenizer:
+# git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish
+```
+Run the browser-worker simulator (mirrors the WebGPU worker logic):
+```bash
+# Full parity check (PyTorch vs ONNX)
+LD_LIBRARY_PATH=/path/to/cudnn/lib python scripts/compare_onnx_vs_pytorch.py --mode parity
+# ONNX only (skip PyTorch)
+python scripts/compare_onnx_vs_pytorch.py --mode parity --skip-pytorch
+# Component-level debug
+python scripts/compare_onnx_vs_pytorch.py --mode debug
+```
+Key generation parameters (matching `inference_example.py`):
+| Parameter | Value |
+|---|---|
+| `repetition_penalty` | 1.2 |
+| `temperature` | 0.8 |
+| `exaggeration` | 0.6 |
+| `cfg_weight` | 0.5 |
+| `min_p` | 0.05 |
+---
+## Quality Results
+Evaluated with Gemini 2.5 Flash (MOS) and Groq Whisper (WER):
+| | PyTorch | ONNX |
+|---|---|---|
+| MOS (1–5) | 3.0 | 3.0 |
+| WER | 20% | 20% |
+| MFCC cosine | — | 0.996 |
+| Duration | 5.98 s | ~5.6 s |
+Waveforms differ (mel cosine ~0.65–0.75) due to stochastic sampling and different conditioning voice, but phonetic content is nearly identical (MFCC = 0.996).
+---
+## Known Limitations
+1. **Fixed conditioning voice**: `finnish_cond_emb.bin` was computed from a specific reference recording using the Finnish `cond_enc` weights. Custom reference audio changes speaker identity (via `speaker_emb` + `speaker_features`) but not the T3 conditioning. A `finnish_cond_enc.onnx` export would fix this — see `scripts/export_finnish_embeddings.py`.
+2. **Watermarking skipped**: The PyTorch model applies Perth watermarking; the ONNX pipeline does not.
+3. **Minimum token length**: The decoder requires the combined `[prompt_tokens | generated_tokens]` to be at least ~150 tokens, otherwise an Expand error occurs.
+---
+## Related
+- [Finnish-NLP/Chatterbox-Finnish](https://huggingface.co/Finnish-NLP/Chatterbox-Finnish) — PyTorch fine-tune + training code
+- [onnx-community/chatterbox-multilingual-ONNX](https://huggingface.co/onnx-community/chatterbox-multilingual-ONNX) — base ONNX components
+- [ResembleAI/chatterbox](https://github.com/resemble-ai/chatterbox) — original model