---
language:
- fi
license: mit
tags:
- text-to-speech
- tts
- zero-shot
- voice-cloning
- finnish
datasets:
- mozilla-foundation/common_voice_15_0
base_model: ResembleAI/chatterbox
pipeline_tag: text-to-speech
library_name: pytorch
model-index:
- name: Chatterbox Finnish Fine-Tuned (Step 986)
results:
- task:
type: text-to-speech
name: Text to Speech
dataset:
name: Mozilla Common Voice 15.0 (Finnish OOD)
type: mozilla-foundation/common_voice_15_0
config: fi
split: test
metrics:
- name: Word Error Rate (WER)
type: wer
value: 2.76
verified: true
- name: Mean Opinion Score (MOS)
type: mos
value: 4.34
---
# Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS
This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds.
## 🚀 Performance (Zero-Shot OOD)
The following metrics were calculated on **Out-of-Distribution (OOD)** speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.
| Metric | Baseline (Original Multilingual) | Fine-Tuned (Step 986) | Improvement |
| :--- | :---: | :---: | :---: |
| **Avg Word Error Rate (WER)** | 28.94% | **2.76%** | **~10.5x Accuracy Increase** |
| **Mean Opinion Score (MOS)** | 2.29 / 5.0 | **4.34 / 5.0** | **+2.05 Quality Points** |
*Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.*
---
## 🎧 Audio Comparison (OOD Speakers)
Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers **never seen during training**.
| Speaker ID | Baseline (Generic Multilingual) | Fine-Tuned (Finnish Golden) |
| :--- | :--- | :--- |
| **cv-15_11** | | |
| **cv-15_16** | | |
| **cv-15_2** | | |
---
## 🛠 Data Processing & Transparency
The model was trained on a diverse corpus of **16,604 samples** to capture the nuances of Finnish phonetics, including vowel length and gemination.
* **Sources**: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV).
* **Zero-Shot Integrity**: Specific speakers (`cv-15_11`, `cv-15_16`, `cv-15_2`) were strictly excluded from training to ensure valid OOD testing.
* **Traceability**: Full attribution and filtering lineage are provided in `attribution.csv`.
---
## 🔬 Phase 2 Research: Single-Speaker Fine-Tuning
As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset).
### Results & Optimization
We used `sweep_params.py` to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of **4.63 MOS**.
**Best Parameters for Finnish:**
* `repetition_penalty`: 1.5 (Balanced for Finnish long vowels)
* `temperature`: 0.8
* `exaggeration`: 0.5
* `cfg_weight`: 0.3
### Research Samples (Cloned Voice)
* **Everyday Phrases**: [Polite Request](eval_results_stage2_lean/checkpoint-16_everyday_0.wav) | [Morning Greeting](eval_results_stage2_lean/checkpoint-16_everyday_2.wav)
*Note: The single-speaker weights are not included in this repository.*
---
## 💻 Hardware & Infrastructure
* **Platform**: Verda (NVIDIA A100 80GB)
* **Mixed Precision**: BF16 for stability.
* **Repetition Guard**: Custom threshold of **10 tokens** in `AlignmentStreamAnalyzer` to support Finnish phonology.
---
## 🏃 Running Inference
```python
from src.chatterbox_.mtl_tts import ChatterboxMultilingualTTS
# 1. Load the engine
engine = ChatterboxMultilingualTTS.from_local("./pretrained_models", device="cuda")
# 2. Inject weights (e.g., best_finnish_multilingual_cp986.safetensors)
# engine.t3.load_state_dict(...)
# 3. Generate with Finnish-optimized parameters
wav = engine.generate(
text="Suomen kieli on poikkeuksellisen kaunista kuunneltavaa.",
language_id="fi",
audio_prompt_path="path/to/reference.wav",
repetition_penalty=1.5,
temperature=0.8,
exaggeration=0.5,
cfg_weight=0.3
)
```
---
## 🙏 Acknowledgments & Credits
- **Exploration Foundation**: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan.
- **Model Authors**: Deep thanks to the team at **ResembleAI** for the [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) model.
- **Single speaker finetuning**: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder)
- **Data Sourcing**: Thanks to **#Jobik** at **Nordic AI** Discord for the dataset insights.
## Disclaimer
- **Don't use this model to do bad things.**