Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS

This project focuses on fine-tuning the Chatterbox TTS model (based on the Llama architecture) specifically for the Finnish language. By leveraging a multilingual base and optimizing the inference context, we achieved exceptional zero-shot generalization to unseen Finnish speakers, surpassing commercial-grade quality thresholds.

🚀 Performance Comparison (Zero-Shot OOD)

The following metrics were calculated on Out-of-Distribution (OOD) speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.

Metric	Baseline (Original Multilingual)	Fine-Tuned (Best Step: 986)	Improvement
Avg Word Error Rate (WER)	28.94%	2.76%	~10.5x Accuracy Increase
Mean Opinion Score (MOS)	2.29 / 5.0	4.34 / 5.0	+2.05 Quality Points

Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.

🎧 Audio Comparison (OOD Speakers)

Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from Out-of-Distribution (OOD) speakers.

Why OOD Testing?

OOD testing is the "Gold Standard" for evaluating zero-shot TTS. It ensures that the model hasn't just "memorized" the voices in the training set. By testing on speakers that were strictly excluded from all training and validation phases, we prove that the model has learned the underlying logic of the Finnish language and can apply it to any new voice it encounters.

Important: The speakers below were never seen by the model during fine-tuning. This is a pure test of generalization.

Speaker ID	Baseline (Generic Multilingual)	Fine-Tuned (Finnish Golden)
cv-15_11	Baseline Audio	Fine-Tuned Audio
cv-15_16	Baseline Audio	Fine-Tuned Audio
cv-15_2	Baseline Audio	Fine-Tuned Audio

The samples above use the same text and reference audio for a fair comparison.

🛠 Data Processing & Transparency

We utilized a diverse Finnish dataset to teach the model the nuances of Finnish phonetics, including vowel length and gemination. The final training set consists of 16,604 samples.

1. Dataset Breakdown

The dataset is a diverse mix of Finnish speech from the following sources:

Mozilla Common Voice (cv-15): Primary source for diverse speaker profiles.
Filmot: Media-based Finnish for natural conversational flow.
YouTube: Modern spoken Finnish.
Parliament: Formal Finnish speech.

2. Zero-Shot Integrity

To ensure absolute zero-shot performance, we strictly excluded specific speakers (cv-15_11, cv-15_16, cv-15_2) from the training loop. This ensures the 4.34 MOS is a true reflection of the model's ability to generalize to new Finnish voices.

3. Traceability & Lineage

Full attribution for the dataset is provided in attribution.csv. This file maps every training sample to its speaker ID and source, ensuring transparency and reproducibility.

💻 Hardware & Infrastructure

This training was performed on the Verda platform using an NVIDIA A100 80GB instance. This high-VRAM instance allowed us to use optimal batch sizes and extended speech sequences (up to 1024 tokens) without memory constraints.

.devcontainer Configuration

We have included the .devcontainer directory to ensure a reproducible environment. It pre-installs all necessary CUDA-optimized libraries and sets up the environment for immediate experimentation.

🔧 Installation & Setup

Environment: Ensure you have Python 3.10+ and CUDA-capable hardware.

Setup:

bash install_dependencies.sh
python setup.py  # Downloads the multilingual base weights

🏃 Running Inference

To generate Finnish speech using the fine-tuned model:

from src.chatterbox_.tts import ChatterboxTTS

# 1. Load the engine
engine = ChatterboxTTS.from_local("./pretrained_models", device="cuda")

# 2. Inject your best finetuned weights
# (Best weights: best_finnish_multilingual_cp986.safetensors)
# engine.t3.load_state_dict(...) 

# 3. Generate with Finnish-optimized parameters
wav = engine.generate(
    text="Suomen kieli on poikkeuksellisen kaunista kuunneltavaa.",
    audio_prompt_path="path/to/reference_voice.wav",
    repetition_penalty=1.2,
    temperature=0.8,
    exaggeration=0.6
)

Optimized Parameters for Finnish

Based on our research, we identified the following settings as the most stable for Finnish phonetics:

repetition_penalty: 1.2
temperature: 0.8
Prompt Window: Increased to 3.0 seconds during inference to capture the melodic cadence of Finnish sentences.
Repetition Guard: Increased to 10 tokens in AlignmentStreamAnalyzer to allow for natural long Finnish vowels without premature audio cutoffs.

🙏 Acknowledgments & Credits

Exploration Foundation: Initial fine-tuning exploration was based on the chatterbox-finetuning toolkit by gokhaneraslan.
Model Authors: Deep thanks to the team at ResembleAI for releasing the Chatterbox TTS model.
Data Sourcing: Special thanks to #Jobik at Nordic AI Discord for introducing Filmot, which was instrumental in sourcing high-quality media-based Finnish data.

Disclaimer

Don't use this model to do bad things.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for RASMUS/Chatterbox-Finnish

Base model

ResembleAI/chatterbox

Finetuned

(28)

this model

Evaluation results

Word Error Rate (WER) on Mozilla Common Voice 15.0 (Finnish OOD)
test set self-reported

2.760
Mean Opinion Score (MOS) on Mozilla Common Voice 15.0 (Finnish OOD)
test set self-reported

4.340