File size: 7,669 Bytes
227d66e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67ea4ca
227d66e
 
 
 
 
 
 
 
 
 
 
 
67ea4ca
227d66e
 
 
67ea4ca
227d66e
 
308155b
 
9d79236
308155b
9d79236
308155b
 
 
9d79236
308155b
67ea4ca
 
308155b
67ea4ca
308155b
 
 
 
 
9d79236
308155b
 
 
8c33cdf
 
 
308155b
 
 
 
 
9d79236
308155b
8db56bc
9d79236
 
308155b
9d79236
308155b
9d79236
308155b
9d79236
 
 
 
308155b
9d79236
 
 
 
 
308155b
9d79236
 
 
 
308155b
 
 
9d79236
308155b
9d79236
 
 
308155b
 
 
d15775e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
308155b
 
 
d15775e
 
 
 
 
 
308155b
d15775e
 
308155b
d15775e
 
 
 
308155b
 
 
d15775e
 
 
308155b
d15775e
308155b
d15775e
 
 
 
 
 
 
 
308155b
 
 
 
 
 
 
9d79236
7bfd8e6
9d79236
0cc6dc1
 
9d79236
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
language:
- fi
license: mit
tags:
- text-to-speech
- tts
- zero-shot
- voice-cloning
- finnish
datasets:
- mozilla-foundation/common_voice_15_0
base_model: ResembleAI/chatterbox
pipeline_tag: text-to-speech
library_name: pytorch
model-index:
- name: Chatterbox Finnish Fine-Tuned (Step 986)
  results:
  - task:
      type: text-to-speech
      name: Text to Speech
    dataset:
      name: Mozilla Common Voice 15.0 (Finnish OOD)
      type: mozilla-foundation/common_voice_15_0
      config: fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 2.76
      verified: true
    - name: Mean Opinion Score (MOS)
      type: mos
      value: 4.34
---

# Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS

This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds.

## 🚀 Performance (Zero-Shot OOD)

The following metrics were calculated on **Out-of-Distribution (OOD)** speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.

| Metric | Baseline (Original Multilingual) | Fine-Tuned (Step 986) | Improvement |
| :--- | :---: | :---: | :---: |
| **Avg Word Error Rate (WER)** | 28.94% | **2.76%** | **~10.5x Accuracy Increase** |
| **Mean Opinion Score (MOS)** | 2.29 / 5.0 | **4.34 / 5.0** | **+2.05 Quality Points** |

*Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.*

---

## 🎧 Audio Comparison (OOD Speakers)

Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers **never seen during training**.

| Speaker ID | Baseline (Generic Multilingual) | Fine-Tuned (Finnish Golden) |
| :--- | :--- | :--- |
| **cv-15_11** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_11_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_11_finetuned.wav" type="audio/wav"></audio>|
| **cv-15_16** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_16_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_16_finetuned.wav" type="audio/wav"></audio>|
| **cv-15_2** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_2_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_2_finetuned.wav" type="audio/wav"></audio>|

---

## 🛠 Data Processing & Transparency

The model was trained on a diverse corpus of **16,604 samples** to capture the nuances of Finnish phonetics, including vowel length and gemination.

*   **Sources**: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV).
*   **Zero-Shot Integrity**: Specific speakers (`cv-15_11`, `cv-15_16`, `cv-15_2`) were strictly excluded from training to ensure valid OOD testing.
*   **Traceability**: Full attribution and filtering lineage are provided in `attribution.csv`.

---

## 🔬 Phase 2 Research: Single-Speaker Fine-Tuning

As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset). 

### Results & Optimization
We used `sweep_params.py` to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of **4.63 MOS**.

**Best Parameters for Finnish:**
*   `repetition_penalty`: 1.5 (Balanced for Finnish long vowels)
*   `temperature`: 0.8
*   `exaggeration`: 0.5
*   `cfg_weight`: 0.3

### Research Samples (Cloned Voice)
*   **Everyday Phrases**: [Polite Request](eval_results_stage2_lean/checkpoint-16_everyday_0.wav) | [Morning Greeting](eval_results_stage2_lean/checkpoint-16_everyday_2.wav)

*Note: The single-speaker weights are not included in this repository.*

---

## 💻 Hardware & Infrastructure

*   **Platform**: Verda (NVIDIA A100 80GB)
*   **Mixed Precision**: BF16 for stability.
*   **Repetition Guard**: Custom threshold of **10 tokens** in `AlignmentStreamAnalyzer` to support Finnish phonology.

---

## 🚀 Quick Start

### Option A — Dev Container (recommended)

Open this repo in VS Code with the [Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension. Everything — dependencies, base model weights, GPU detection — is handled automatically by `postCreateCommand`.

### Option B — Manual Setup

```bash
# 1. Clone (with LFS for model weights)
git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish
cd Chatterbox-Finnish

# 2. Install dependencies (auto-detects your GPU architecture)
bash install_dependencies.sh

# 3. Download pretrained base models from ResembleAI
python setup.py

# 4. Run inference
python inference_example.py
```

> **GPU compatibility:** The install script detects your GPU and picks the right PyTorch build automatically:
> - **Blackwell (sm_120+)** e.g. RTX PRO 6000 → PyTorch 2.10.0 + CUDA 12.8
> - **Older GPUs (A100, RTX 30/40xx, etc.)** → PyTorch 2.5.1 + CUDA 12.4

---

## 🏃 Running Inference

```python
import torch
import soundfile as sf
from src.chatterbox_.tts import ChatterboxTTS
from safetensors.torch import load_file

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Load the base engine
engine = ChatterboxTTS.from_local("./pretrained_models", device=device)

# 2. Inject Finnish fine-tuned weights
checkpoint = load_file("./models/best_finnish_multilingual_cp986.safetensors")
t3_state = {k[3:] if k.startswith("t3.") else k: v for k, v in checkpoint.items()}
engine.t3.load_state_dict(t3_state, strict=False)

# 3. Generate with Finnish-optimized parameters
wav = engine.generate(
    text="Tervetuloa kokeilemaan hienoviritettyä suomenkielistä Chatterbox-puhesynteesiä.",
    audio_prompt_path="./samples/reference_finnish.wav",
    repetition_penalty=1.2,
    temperature=0.8,
    exaggeration=0.6,
)

sf.write("output.wav", wav.squeeze().cpu().numpy(), engine.sr)
```

Or just run the included example script directly:

```bash
python inference_example.py  # outputs output_finnish.wav
```

---

## 🙏 Acknowledgments & Credits

- **Exploration Foundation**: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan.
- **Model Authors**: Deep thanks to the team at **ResembleAI** for the [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) model.
- **Single speaker finetuning**: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder)
- **Data Sourcing**: Thanks to **#Jobik** at **Nordic AI** Discord for the dataset insights.

## Disclaimer
- **Don't use this model to do bad things.**