--- language: - fi license: mit tags: - text-to-speech - tts - zero-shot - voice-cloning - finnish datasets: - mozilla-foundation/common_voice_15_0 base_model: ResembleAI/chatterbox pipeline_tag: text-to-speech library_name: pytorch model-index: - name: Chatterbox Finnish Fine-Tuned (Step 986) results: - task: type: text-to-speech name: Text to Speech dataset: name: Mozilla Common Voice 15.0 (Finnish OOD) type: mozilla-foundation/common_voice_15_0 config: fi split: test metrics: - name: Word Error Rate (WER) type: wer value: 2.76 verified: true - name: Mean Opinion Score (MOS) type: mos value: 4.34 --- # Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds. ## 🚀 Performance (Zero-Shot OOD) The following metrics were calculated on **Out-of-Distribution (OOD)** speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before. | Metric | Baseline (Original Multilingual) | Fine-Tuned (Step 986) | Improvement | | :--- | :---: | :---: | :---: | | **Avg Word Error Rate (WER)** | 28.94% | **2.76%** | **~10.5x Accuracy Increase** | | **Mean Opinion Score (MOS)** | 2.29 / 5.0 | **4.34 / 5.0** | **+2.05 Quality Points** | *Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.* --- ## 🎧 Audio Comparison (OOD Speakers) Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers **never seen during training**. | Speaker ID | Baseline (Generic Multilingual) | Fine-Tuned (Finnish Golden) | | :--- | :--- | :--- | | **cv-15_11** |

| | **cv-15_16** |

| | **cv-15_2** |

| --- ## 🛠 Data Processing & Transparency The model was trained on a diverse corpus of **16,604 samples** to capture the nuances of Finnish phonetics, including vowel length and gemination. * **Sources**: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV). * **Zero-Shot Integrity**: Specific speakers (`cv-15_11`, `cv-15_16`, `cv-15_2`) were strictly excluded from training to ensure valid OOD testing. * **Traceability**: Full attribution and filtering lineage are provided in `attribution.csv`. --- ## 🔬 Phase 2 Research: Single-Speaker Fine-Tuning As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset). ### Results & Optimization We used `sweep_params.py` to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of **4.63 MOS**. **Best Parameters for Finnish:** * `repetition_penalty`: 1.5 (Balanced for Finnish long vowels) * `temperature`: 0.8 * `exaggeration`: 0.5 * `cfg_weight`: 0.3 ### Research Samples (Cloned Voice) * **Everyday Phrases**: [Polite Request](eval_results_stage2_lean/checkpoint-16_everyday_0.wav) | [Morning Greeting](eval_results_stage2_lean/checkpoint-16_everyday_2.wav) *Note: The single-speaker weights are not included in this repository.* --- ## 💻 Hardware & Infrastructure * **Platform**: Verda (NVIDIA A100 80GB) * **Mixed Precision**: BF16 for stability. * **Repetition Guard**: Custom threshold of **10 tokens** in `AlignmentStreamAnalyzer` to support Finnish phonology. --- ## 🏃 Running Inference ```python from src.chatterbox_.mtl_tts import ChatterboxMultilingualTTS # 1. Load the engine engine = ChatterboxMultilingualTTS.from_local("./pretrained_models", device="cuda") # 2. Inject weights (e.g., best_finnish_multilingual_cp986.safetensors) # engine.t3.load_state_dict(...) # 3. Generate with Finnish-optimized parameters wav = engine.generate( text="Suomen kieli on poikkeuksellisen kaunista kuunneltavaa.", language_id="fi", audio_prompt_path="path/to/reference.wav", repetition_penalty=1.5, temperature=0.8, exaggeration=0.5, cfg_weight=0.3 ) ``` --- ## 🙏 Acknowledgments & Credits - **Exploration Foundation**: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan. - **Model Authors**: Deep thanks to the team at **ResembleAI** for the [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) model. - **Single speaker finetuning**: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder) - **Data Sourcing**: Thanks to **#Jobik** at **Nordic AI** Discord for the dataset insights. ## Disclaimer - **Don't use this model to do bad things.**