Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS
This project focuses on fine-tuning the Chatterbox TTS model (based on the Llama architecture) specifically for the Finnish language. By leveraging a multilingual base and optimizing the inference context, we achieved exceptional zero-shot generalization to unseen Finnish speakers, surpassing commercial-grade quality thresholds.
π Performance Comparison (Zero-Shot OOD)
The following metrics were calculated on Out-of-Distribution (OOD) speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.
| Metric | Baseline (Original Multilingual) | Fine-Tuned (Best Step: 986) | Improvement |
|---|---|---|---|
| Avg Word Error Rate (WER) | 28.94% | 2.76% | ~10.5x Accuracy Increase |
| Mean Opinion Score (MOS) | 2.29 / 5.0 | 4.34 / 5.0 | +2.05 Quality Points |
Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.
π§ Audio Comparison (OOD Speakers)
Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from Out-of-Distribution (OOD) speakers.
Why OOD Testing?
OOD testing is the "Gold Standard" for evaluating zero-shot TTS. It ensures that the model hasn't just "memorized" the voices in the training set. By testing on speakers that were strictly excluded from all training and validation phases, we prove that the model has learned the underlying logic of the Finnish language and can apply it to any new voice it encounters.
Important: The speakers below were never seen by the model during fine-tuning. This is a pure test of generalization.
| Speaker ID | Baseline (Generic Multilingual) | Fine-Tuned (Finnish Golden) |
|---|---|---|
| cv-15_11 | Baseline Audio | Fine-Tuned Audio |
| cv-15_16 | Baseline Audio | Fine-Tuned Audio |
| cv-15_2 | Baseline Audio | Fine-Tuned Audio |
The samples above use the same text and reference audio for a fair comparison.
π Data Processing & Transparency
We utilized a diverse Finnish dataset to teach the model the nuances of Finnish phonetics, including vowel length and gemination. The final training set consists of 16,604 samples.
1. Dataset Breakdown
The dataset is a diverse mix of Finnish speech from the following sources:
- Mozilla Common Voice (cv-15): Primary source for diverse speaker profiles.
- Filmot: Media-based Finnish for natural conversational flow.
- YouTube: Modern spoken Finnish.
- Parliament: Formal Finnish speech.
2. Zero-Shot Integrity
To ensure absolute zero-shot performance, we strictly excluded specific speakers (cv-15_11, cv-15_16, cv-15_2) from the training loop. This ensures the 4.34 MOS is a true reflection of the model's ability to generalize to new Finnish voices.
3. Traceability & Lineage
Full attribution for the dataset is provided in attribution.csv. This file maps every training sample to its speaker ID and source, ensuring transparency and reproducibility.
π» Hardware & Infrastructure
This training was performed on the Verda platform using an NVIDIA A100 80GB instance. This high-VRAM instance allowed us to use optimal batch sizes and extended speech sequences (up to 1024 tokens) without memory constraints.
.devcontainer Configuration
We have included the .devcontainer directory to ensure a reproducible environment. It pre-installs all necessary CUDA-optimized libraries and sets up the environment for immediate experimentation.
π§ Installation & Setup
- Environment: Ensure you have Python 3.10+ and CUDA-capable hardware.
- Setup:
bash install_dependencies.sh python setup.py # Downloads the multilingual base weights
π Running Inference
To generate Finnish speech using the fine-tuned model:
from src.chatterbox_.tts import ChatterboxTTS
# 1. Load the engine
engine = ChatterboxTTS.from_local("./pretrained_models", device="cuda")
# 2. Inject your best finetuned weights
# (Best weights: best_finnish_multilingual_cp986.safetensors)
# engine.t3.load_state_dict(...)
# 3. Generate with Finnish-optimized parameters
wav = engine.generate(
text="Suomen kieli on poikkeuksellisen kaunista kuunneltavaa.",
audio_prompt_path="path/to/reference_voice.wav",
repetition_penalty=1.2,
temperature=0.8,
exaggeration=0.6
)
Optimized Parameters for Finnish
Based on our research, we identified the following settings as the most stable for Finnish phonetics:
repetition_penalty: 1.2temperature: 0.8- Prompt Window: Increased to 3.0 seconds during inference to capture the melodic cadence of Finnish sentences.
- Repetition Guard: Increased to 10 tokens in
AlignmentStreamAnalyzerto allow for natural long Finnish vowels without premature audio cutoffs.
π Acknowledgments & Credits
- Exploration Foundation: Initial fine-tuning exploration was based on the chatterbox-finetuning toolkit by gokhaneraslan.
- Model Authors: Deep thanks to the team at ResembleAI for releasing the Chatterbox TTS model.
- Data Sourcing: Special thanks to #Jobik at Nordic AI Discord for introducing Filmot, which was instrumental in sourcing high-quality media-based Finnish data.
Disclaimer
- Don't use this model to do bad things.
Model tree for RASMUS/Chatterbox-Finnish
Base model
ResembleAI/chatterboxEvaluation results
- Word Error Rate (WER) on Mozilla Common Voice 15.0 (Finnish OOD)test set self-reported2.760
- Mean Opinion Score (MOS) on Mozilla Common Voice 15.0 (Finnish OOD)test set self-reported4.340