| | --- |
| | language: |
| | - fi |
| | license: mit |
| | tags: |
| | - text-to-speech |
| | - tts |
| | - zero-shot |
| | - voice-cloning |
| | - finnish |
| | datasets: |
| | - mozilla-foundation/common_voice_15_0 |
| | base_model: ResembleAI/chatterbox |
| | pipeline_tag: text-to-speech |
| | library_name: pytorch |
| | model-index: |
| | - name: Chatterbox Finnish Fine-Tuned (Step 986) |
| | results: |
| | - task: |
| | type: text-to-speech |
| | name: Text to Speech |
| | dataset: |
| | name: Mozilla Common Voice 15.0 (Finnish OOD) |
| | type: mozilla-foundation/common_voice_15_0 |
| | config: fi |
| | split: test |
| | metrics: |
| | - name: Word Error Rate (WER) |
| | type: wer |
| | value: 2.76 |
| | verified: true |
| | - name: Mean Opinion Score (MOS) |
| | type: mos |
| | value: 4.34 |
| | --- |
| | |
| | # Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS |
| |
|
| | This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds. |
| |
|
| | ## π Performance (Zero-Shot OOD) |
| |
|
| | The following metrics were calculated on **Out-of-Distribution (OOD)** speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before. |
| |
|
| | | Metric | Baseline (Original Multilingual) | Fine-Tuned (Step 986) | Improvement | |
| | | :--- | :---: | :---: | :---: | |
| | | **Avg Word Error Rate (WER)** | 28.94% | **2.76%** | **~10.5x Accuracy Increase** | |
| | | **Mean Opinion Score (MOS)** | 2.29 / 5.0 | **4.34 / 5.0** | **+2.05 Quality Points** | |
| |
|
| | *Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.* |
| |
|
| | --- |
| |
|
| | ## π§ Audio Comparison (OOD Speakers) |
| |
|
| | Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers **never seen during training**. |
| |
|
| | | Speaker ID | Baseline (Generic Multilingual) | Fine-Tuned (Finnish Golden) | |
| | | :--- | :--- | :--- | |
| | | **cv-15_11** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_11_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_11_finetuned.wav" type="audio/wav"></audio>| |
| | | **cv-15_16** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_16_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_16_finetuned.wav" type="audio/wav"></audio>| |
| | | **cv-15_2** | <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_2_baseline.wav" type="audio/wav"></audio>| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_2_finetuned.wav" type="audio/wav"></audio>| |
| | |
| | --- |
| | |
| | ## π Data Processing & Transparency |
| | |
| | The model was trained on a diverse corpus of **16,604 samples** to capture the nuances of Finnish phonetics, including vowel length and gemination. |
| | |
| | * **Sources**: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV). |
| | * **Zero-Shot Integrity**: Specific speakers (`cv-15_11`, `cv-15_16`, `cv-15_2`) were strictly excluded from training to ensure valid OOD testing. |
| | * **Traceability**: Full attribution and filtering lineage are provided in `attribution.csv`. |
| | |
| | --- |
| | |
| | ## π¬ Phase 2 Research: Single-Speaker Fine-Tuning |
| | |
| | As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset). |
| | |
| | ### Results & Optimization |
| | We used `sweep_params.py` to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of **4.63 MOS**. |
| | |
| | **Best Parameters for Finnish:** |
| | * `repetition_penalty`: 1.5 (Balanced for Finnish long vowels) |
| | * `temperature`: 0.8 |
| | * `exaggeration`: 0.5 |
| | * `cfg_weight`: 0.3 |
| | |
| | ### Research Samples (Cloned Voice) |
| | * **Everyday Phrases**: [Polite Request](eval_results_stage2_lean/checkpoint-16_everyday_0.wav) | [Morning Greeting](eval_results_stage2_lean/checkpoint-16_everyday_2.wav) |
| | |
| | *Note: The single-speaker weights are not included in this repository.* |
| | |
| | --- |
| | |
| | ## π» Hardware & Infrastructure |
| | |
| | * **Platform**: Verda (NVIDIA A100 80GB) |
| | * **Mixed Precision**: BF16 for stability. |
| | * **Repetition Guard**: Custom threshold of **10 tokens** in `AlignmentStreamAnalyzer` to support Finnish phonology. |
| | |
| | --- |
| | |
| | ## π Running Inference |
| | |
| | ```python |
| | from src.chatterbox_.mtl_tts import ChatterboxMultilingualTTS |
| | |
| | # 1. Load the engine |
| | engine = ChatterboxMultilingualTTS.from_local("./pretrained_models", device="cuda") |
| | |
| | # 2. Inject weights (e.g., best_finnish_multilingual_cp986.safetensors) |
| | # engine.t3.load_state_dict(...) |
| | |
| | # 3. Generate with Finnish-optimized parameters |
| | wav = engine.generate( |
| | text="Suomen kieli on poikkeuksellisen kaunista kuunneltavaa.", |
| | language_id="fi", |
| | audio_prompt_path="path/to/reference.wav", |
| | repetition_penalty=1.5, |
| | temperature=0.8, |
| | exaggeration=0.5, |
| | cfg_weight=0.3 |
| | ) |
| | ``` |
| | |
| | --- |
| | |
| | ## π Acknowledgments & Credits |
| | |
| | - **Exploration Foundation**: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan. |
| | - **Model Authors**: Deep thanks to the team at **ResembleAI** for the [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) model. |
| | - **Single speaker finetuning**: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder) |
| | - **Data Sourcing**: Thanks to **#Jobik** at **Nordic AI** Discord for the dataset insights. |
| | |
| | ## Disclaimer |
| | - **Don't use this model to do bad things.** |
| | |