File size: 7,669 Bytes
227d66e 67ea4ca 227d66e 67ea4ca 227d66e 67ea4ca 227d66e 308155b 9d79236 308155b 9d79236 308155b 9d79236 308155b 67ea4ca 308155b 67ea4ca 308155b 9d79236 308155b 8c33cdf 308155b 9d79236 308155b 8db56bc 9d79236 308155b 9d79236 308155b 9d79236 308155b 9d79236 308155b 9d79236 308155b 9d79236 308155b 9d79236 308155b 9d79236 308155b d15775e 308155b d15775e 308155b d15775e 308155b d15775e 308155b d15775e 308155b d15775e 308155b d15775e 308155b 9d79236 7bfd8e6 9d79236 0cc6dc1 9d79236 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | ---
language:
- fi
license: mit
tags:
- text-to-speech
- tts
- zero-shot
- voice-cloning
- finnish
datasets:
- mozilla-foundation/common_voice_15_0
base_model: ResembleAI/chatterbox
pipeline_tag: text-to-speech
library_name: pytorch
model-index:
- name: Chatterbox Finnish Fine-Tuned (Step 986)
results:
- task:
type: text-to-speech
name: Text to Speech
dataset:
name: Mozilla Common Voice 15.0 (Finnish OOD)
type: mozilla-foundation/common_voice_15_0
config: fi
split: test
metrics:
- name: Word Error Rate (WER)
type: wer
value: 2.76
verified: true
- name: Mean Opinion Score (MOS)
type: mos
value: 4.34
---
# Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS
This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds.
## 🚀 Performance (Zero-Shot OOD)
The following metrics were calculated on **Out-of-Distribution (OOD)** speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.
| Metric | Baseline (Original Multilingual) | Fine-Tuned (Step 986) | Improvement |
| :--- | :---: | :---: | :---: |
| **Avg Word Error Rate (WER)** | 28.94% | **2.76%** | **~10.5x Accuracy Increase** |
| **Mean Opinion Score (MOS)** | 2.29 / 5.0 | **4.34 / 5.0** | **+2.05 Quality Points** |
*Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.*
---
## 🎧 Audio Comparison (OOD Speakers)
Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers **never seen during training**.
| Speaker ID | Baseline (Generic Multilingual) | Fine-Tuned (Finnish Golden) |
| :--- | :--- | :--- |
| **cv-15_11** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_11_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_11_finetuned.wav" type="audio/wav"></audio>|
| **cv-15_16** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_16_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_16_finetuned.wav" type="audio/wav"></audio>|
| **cv-15_2** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_2_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_2_finetuned.wav" type="audio/wav"></audio>|
---
## 🛠 Data Processing & Transparency
The model was trained on a diverse corpus of **16,604 samples** to capture the nuances of Finnish phonetics, including vowel length and gemination.
* **Sources**: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV).
* **Zero-Shot Integrity**: Specific speakers (`cv-15_11`, `cv-15_16`, `cv-15_2`) were strictly excluded from training to ensure valid OOD testing.
* **Traceability**: Full attribution and filtering lineage are provided in `attribution.csv`.
---
## 🔬 Phase 2 Research: Single-Speaker Fine-Tuning
As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset).
### Results & Optimization
We used `sweep_params.py` to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of **4.63 MOS**.
**Best Parameters for Finnish:**
* `repetition_penalty`: 1.5 (Balanced for Finnish long vowels)
* `temperature`: 0.8
* `exaggeration`: 0.5
* `cfg_weight`: 0.3
### Research Samples (Cloned Voice)
* **Everyday Phrases**: [Polite Request](eval_results_stage2_lean/checkpoint-16_everyday_0.wav) | [Morning Greeting](eval_results_stage2_lean/checkpoint-16_everyday_2.wav)
*Note: The single-speaker weights are not included in this repository.*
---
## 💻 Hardware & Infrastructure
* **Platform**: Verda (NVIDIA A100 80GB)
* **Mixed Precision**: BF16 for stability.
* **Repetition Guard**: Custom threshold of **10 tokens** in `AlignmentStreamAnalyzer` to support Finnish phonology.
---
## 🚀 Quick Start
### Option A — Dev Container (recommended)
Open this repo in VS Code with the [Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension. Everything — dependencies, base model weights, GPU detection — is handled automatically by `postCreateCommand`.
### Option B — Manual Setup
```bash
# 1. Clone (with LFS for model weights)
git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish
cd Chatterbox-Finnish
# 2. Install dependencies (auto-detects your GPU architecture)
bash install_dependencies.sh
# 3. Download pretrained base models from ResembleAI
python setup.py
# 4. Run inference
python inference_example.py
```
> **GPU compatibility:** The install script detects your GPU and picks the right PyTorch build automatically:
> - **Blackwell (sm_120+)** e.g. RTX PRO 6000 → PyTorch 2.10.0 + CUDA 12.8
> - **Older GPUs (A100, RTX 30/40xx, etc.)** → PyTorch 2.5.1 + CUDA 12.4
---
## 🏃 Running Inference
```python
import torch
import soundfile as sf
from src.chatterbox_.tts import ChatterboxTTS
from safetensors.torch import load_file
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1. Load the base engine
engine = ChatterboxTTS.from_local("./pretrained_models", device=device)
# 2. Inject Finnish fine-tuned weights
checkpoint = load_file("./models/best_finnish_multilingual_cp986.safetensors")
t3_state = {k[3:] if k.startswith("t3.") else k: v for k, v in checkpoint.items()}
engine.t3.load_state_dict(t3_state, strict=False)
# 3. Generate with Finnish-optimized parameters
wav = engine.generate(
text="Tervetuloa kokeilemaan hienoviritettyä suomenkielistä Chatterbox-puhesynteesiä.",
audio_prompt_path="./samples/reference_finnish.wav",
repetition_penalty=1.2,
temperature=0.8,
exaggeration=0.6,
)
sf.write("output.wav", wav.squeeze().cpu().numpy(), engine.sr)
```
Or just run the included example script directly:
```bash
python inference_example.py # outputs output_finnish.wav
```
---
## 🙏 Acknowledgments & Credits
- **Exploration Foundation**: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan.
- **Model Authors**: Deep thanks to the team at **ResembleAI** for the [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) model.
- **Single speaker finetuning**: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder)
- **Data Sourcing**: Thanks to **#Jobik** at **Nordic AI** Discord for the dataset insights.
## Disclaimer
- **Don't use this model to do bad things.**
|