File size: 7,669 Bytes

---
language:
- fi
license: mit
tags:
- text-to-speech
- tts
- zero-shot
- voice-cloning
- finnish
datasets:
- mozilla-foundation/common_voice_15_0
base_model: ResembleAI/chatterbox
pipeline_tag: text-to-speech
library_name: pytorch
model-index:
- name: Chatterbox Finnish Fine-Tuned (Step 986)
  results:
  - task:
      type: text-to-speech
      name: Text to Speech
    dataset:
      name: Mozilla Common Voice 15.0 (Finnish OOD)
      type: mozilla-foundation/common_voice_15_0
      config: fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 2.76
      verified: true
    - name: Mean Opinion Score (MOS)
      type: mos
      value: 4.34
---

# Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS

This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds.

## 🚀 Performance (Zero-Shot OOD)

The following metrics were calculated on **Out-of-Distribution (OOD)** speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.

| Metric | Baseline (Original Multilingual) | Fine-Tuned (Step 986) | Improvement |
| :--- | :---: | :---: | :---: |
| **Avg Word Error Rate (WER)** | 28.94% | **2.76%** | **~10.5x Accuracy Increase** |
| **Mean Opinion Score (MOS)** | 2.29 / 5.0 | **4.34 / 5.0** | **+2.05 Quality Points** |

*Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.*

---

## 🎧 Audio Comparison (OOD Speakers)

Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers **never seen during training**.

| Speaker ID | Baseline (Generic Multilingual) | Fine-Tuned (Finnish Golden) |
| :--- | :--- | :--- |
| **cv-15_11** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_11_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_11_finetuned.wav" type="audio/wav"></audio>|
| **cv-15_16** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_16_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_16_finetuned.wav" type="audio/wav"></audio>|
| **cv-15_2** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_2_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_2_finetuned.wav" type="audio/wav"></audio>|

---

## 🛠 Data Processing & Transparency

The model was trained on a diverse corpus of **16,604 samples** to capture the nuances of Finnish phonetics, including vowel length and gemination.

*   **Sources**: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV).
*   **Zero-Shot Integrity**: Specific speakers (`cv-15_11`, `cv-15_16`, `cv-15_2`) were strictly excluded from training to ensure valid OOD testing.
*   **Traceability**: Full attribution and filtering lineage are provided in `attribution.csv`.

---

## 🔬 Phase 2 Research: Single-Speaker Fine-Tuning

As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset). 

### Results & Optimization
We used `sweep_params.py` to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of **4.63 MOS**.

**Best Parameters for Finnish:**
*   `repetition_penalty`: 1.5 (Balanced for Finnish long vowels)
*   `temperature`: 0.8
*   `exaggeration`: 0.5
*   `cfg_weight`: 0.3

### Research Samples (Cloned Voice)
*   **Everyday Phrases**: [Polite Request](eval_results_stage2_lean/checkpoint-16_everyday_0.wav) | [Morning Greeting](eval_results_stage2_lean/checkpoint-16_everyday_2.wav)

*Note: The single-speaker weights are not included in this repository.*

---

## 💻 Hardware & Infrastructure

*   **Platform**: Verda (NVIDIA A100 80GB)
*   **Mixed Precision**: BF16 for stability.
*   **Repetition Guard**: Custom threshold of **10 tokens** in `AlignmentStreamAnalyzer` to support Finnish phonology.

---

## 🚀 Quick Start

### Option A — Dev Container (recommended)

Open this repo in VS Code with the [Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension. Everything — dependencies, base model weights, GPU detection — is handled automatically by `postCreateCommand`.

### Option B — Manual Setup

```bash
# 1. Clone (with LFS for model weights)
git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish
cd Chatterbox-Finnish

# 2. Install dependencies (auto-detects your GPU architecture)
bash install_dependencies.sh

# 3. Download pretrained base models from ResembleAI
python setup.py

# 4. Run inference
python inference_example.py
```

> **GPU compatibility:** The install script detects your GPU and picks the right PyTorch build automatically:
> - **Blackwell (sm_120+)** e.g. RTX PRO 6000 → PyTorch 2.10.0 + CUDA 12.8
> - **Older GPUs (A100, RTX 30/40xx, etc.)** → PyTorch 2.5.1 + CUDA 12.4

---

## 🏃 Running Inference

```python
import torch
import soundfile as sf
from src.chatterbox_.tts import ChatterboxTTS
from safetensors.torch import load_file

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Load the base engine
engine = ChatterboxTTS.from_local("./pretrained_models", device=device)

# 2. Inject Finnish fine-tuned weights
checkpoint = load_file("./models/best_finnish_multilingual_cp986.safetensors")
t3_state = {k[3:] if k.startswith("t3.") else k: v for k, v in checkpoint.items()}
engine.t3.load_state_dict(t3_state, strict=False)

# 3. Generate with Finnish-optimized parameters
wav = engine.generate(
    text="Tervetuloa kokeilemaan hienoviritettyä suomenkielistä Chatterbox-puhesynteesiä.",
    audio_prompt_path="./samples/reference_finnish.wav",
    repetition_penalty=1.2,
    temperature=0.8,
    exaggeration=0.6,
)

sf.write("output.wav", wav.squeeze().cpu().numpy(), engine.sr)
```

Or just run the included example script directly:

```bash
python inference_example.py  # outputs output_finnish.wav
```

---

## 🙏 Acknowledgments & Credits

- **Exploration Foundation**: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan.
- **Model Authors**: Deep thanks to the team at **ResembleAI** for the [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) model.
- **Single speaker finetuning**: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder)
- **Data Sourcing**: Thanks to **#Jobik** at **Nordic AI** Discord for the dataset insights.

## Disclaimer
- **Don't use this model to do bad things.**