Update README.md

8db56bc verified 27 days ago

6.25 kB

	---
	language:
	- fi
	license: mit
	tags:
	- text-to-speech
	- tts
	- zero-shot
	- voice-cloning
	- finnish
	datasets:
	- mozilla-foundation/common_voice_15_0
	base_model: ResembleAI/chatterbox
	pipeline_tag: text-to-speech
	library_name: pytorch
	model-index:
	- name: Chatterbox Finnish Fine-Tuned (Step 986)
	results:
	- task:
	type: text-to-speech
	name: Text to Speech
	dataset:
	name: Mozilla Common Voice 15.0 (Finnish OOD)
	type: mozilla-foundation/common_voice_15_0
	config: fi
	split: test
	metrics:
	- name: Word Error Rate (WER)
	type: wer
	value: 2.76
	verified: true
	- name: Mean Opinion Score (MOS)
	type: mos
	value: 4.34
	---

	# Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS

	This repository hosts a high-fidelity fine-tuned version of the Chatterbox TTS model, specifically optimized for the Finnish language. By leveraging a multilingual base and large-scale Finnish data, we achieved exceptional zero-shot generalization to unseen speakers, surpassing commercial-grade quality thresholds.

	## 🚀 Performance (Zero-Shot OOD)

	The following metrics were calculated on Out-of-Distribution (OOD) speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.

	\| Metric \| Baseline (Original Multilingual) \| Fine-Tuned (Step 986) \| Improvement \|
	\| :--- \| :---: \| :---: \| :---: \|
	\| Avg Word Error Rate (WER) \| 28.94% \| 2.76% \| ~10.5x Accuracy Increase \|
	\| Mean Opinion Score (MOS) \| 2.29 / 5.0 \| 4.34 / 5.0 \| +2.05 Quality Points \|

	Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.

	---

	## 🎧 Audio Comparison (OOD Speakers)

	Listen to the difference between the generic multilingual baseline and our high-fidelity Finnish fine-tuning. These samples are from speakers never seen during training.

	\| Speaker ID \| Baseline (Generic Multilingual) \| Fine-Tuned (Finnish Golden) \|
	\| :--- \| :--- \| :--- \|
	\| cv-15_11 \| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_11_baseline.wav" type="audio/wav"></audio>\| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_11_finetuned.wav" type="audio/wav"></audio>\|
	\| cv-15_16 \| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_16_baseline.wav" type="audio/wav"></audio>\| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_16_finetuned.wav" type="audio/wav"></audio>\|
	\| cv-15_2 \| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_2_baseline.wav" type="audio/wav"></audio>\| <audio controls><source src="https://huggingface.co/RASMUS/Chatterbox-Finnish/resolve/main/samples/comparison/cv15_2_finetuned.wav" type="audio/wav"></audio>\|

	---

	## 🛠 Data Processing & Transparency

	The model was trained on a diverse corpus of 16,604 samples to capture the nuances of Finnish phonetics, including vowel length and gemination.

	* Sources: Mozilla Common Voice (cv-15, lisence CC0-1.0)), Filmot (CC BY), YouTube (CC BY), and Parliament data (CLARIN PUB +BY +PRIV).
	* Zero-Shot Integrity: Specific speakers (`cv-15_11`, `cv-15_16`, `cv-15_2`) were strictly excluded from training to ensure valid OOD testing.
	* Traceability: Full attribution and filtering lineage are provided in `attribution.csv`.

	---

	## 🔬 Phase 2 Research: Single-Speaker Fine-Tuning

	As a separate research phase, we tested the model's capacity for deep voice cloning by fine-tuning the Phase 1 base on a specific high-quality Finnish dataset (GrowthMindset).

	### Results & Optimization
	We used `sweep_params.py` to identify the "Golden Settings" for the most natural Finnish inference. By evaluating against holdout samples and everyday phrases, we achieved a peak quality of 4.63 MOS.

	Best Parameters for Finnish:
	* `repetition_penalty`: 1.5 (Balanced for Finnish long vowels)
	* `temperature`: 0.8
	* `exaggeration`: 0.5
	* `cfg_weight`: 0.3

	### Research Samples (Cloned Voice)
	* Everyday Phrases: [Polite Request](eval_results_stage2_lean/checkpoint-16_everyday_0.wav) \| [Morning Greeting](eval_results_stage2_lean/checkpoint-16_everyday_2.wav)

	Note: The single-speaker weights are not included in this repository.

	---

	## 💻 Hardware & Infrastructure

	* Platform: Verda (NVIDIA A100 80GB)
	* Mixed Precision: BF16 for stability.
	* Repetition Guard: Custom threshold of 10 tokens in `AlignmentStreamAnalyzer` to support Finnish phonology.

	---

	## 🏃 Running Inference

	```python
	from src.chatterbox_.mtl_tts import ChatterboxMultilingualTTS

	# 1. Load the engine
	engine = ChatterboxMultilingualTTS.from_local("./pretrained_models", device="cuda")

	# 2. Inject weights (e.g., best_finnish_multilingual_cp986.safetensors)
	# engine.t3.load_state_dict(...)

	# 3. Generate with Finnish-optimized parameters
	wav = engine.generate(
	text="Suomen kieli on poikkeuksellisen kaunista kuunneltavaa.",
	language_id="fi",
	audio_prompt_path="path/to/reference.wav",
	repetition_penalty=1.5,
	temperature=0.8,
	exaggeration=0.5,
	cfg_weight=0.3
	)
	```

	---

	## 🙏 Acknowledgments & Credits

	- Exploration Foundation: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan.
	- Model Authors: Deep thanks to the team at ResembleAI for the [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) model.
	- Single speaker finetuning: Huge thanks to Mape for letting me fine-tune using audio from the Growth Mindset Builder YouTube channel. (https://www.youtube.com/@Growthmindsetbuilder)
	- Data Sourcing: Thanks to #Jobik at Nordic AI Discord for the dataset insights.

	## Disclaimer
	- Don't use this model to do bad things.