Upload Finnish Chatterbox model

Browse files

Files changed (14) hide show

.gitattributes +1 -0
README.md +23 -31
attribution.csv +0 -0
generalization_comparison.png +0 -0
inference_example.py +37 -31
models/best_finnish_multilingual_cp986.safetensors +3 -0
samples/comparison/cv15_11_finetuned.wav +2 -2
samples/comparison/cv15_16_finetuned.wav +2 -2
samples/comparison/cv15_2_finetuned.wav +2 -2
src/__pycache__/config.cpython-311.pyc +0 -0
src/__pycache__/dataset.cpython-311.pyc +0 -0
src/config.py +14 -11
src/dataset.py +35 -17
train.py +215 -213

.gitattributes CHANGED Viewed

@@ -41,3 +41,4 @@ samples/comparison/cv15_16_finetuned.wav filter=lfs diff=lfs merge=lfs -text
 samples/comparison/cv15_2_baseline.wav filter=lfs diff=lfs merge=lfs -text
 samples/comparison/cv15_2_finetuned.wav filter=lfs diff=lfs merge=lfs -text
 samples/reference_finnish.wav filter=lfs diff=lfs merge=lfs -text

 samples/comparison/cv15_2_baseline.wav filter=lfs diff=lfs merge=lfs -text
 samples/comparison/cv15_2_finetuned.wav filter=lfs diff=lfs merge=lfs -text
 samples/reference_finnish.wav filter=lfs diff=lfs merge=lfs -text
+generalization_comparison.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ base_model: ResembleAI/chatterbox
 pipeline_tag: text-to-speech
 library_name: pytorch
 model-index:
-- name: Chatterbox Finnish Fine-Tuned (Step 795)
   results:
   - task:
       type: text-to-speech
@@ -27,27 +27,27 @@ model-index:
     metrics:
     - name: Word Error Rate (WER)
       type: wer
-      value: 1.36
       verified: true
     - name: Mean Opinion Score (MOS)
       type: mos
-      value: 4.16
 ---
 # Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS
-This project focuses on fine-tuning the Chatterbox TTS model (based on the Llama architecture) specifically for the Finnish language. By leveraging a multilingual base and applying rigorous data quality filtering, we achieved a near-perfect zero-shot generalization to unseen Finnish speakers.
 ## 🚀 Performance Comparison (Zero-Shot OOD)
 The following metrics were calculated on **Out-of-Distribution (OOD)** speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.
-| Metric | Baseline (Original Multilingual) | Fine-Tuned (Best Step: 795) | Improvement |
 | :--- | :---: | :---: | :---: |
-| **Avg Word Error Rate (WER)** | 28.94% | **1.36%** | **~21x Accuracy Increase** |
-| **Mean Opinion Score (MOS)** | 2.29 / 5.0 | **4.16 / 5.0** | **+1.87 Quality Points** |
-*Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3.*
 ---
@@ -72,30 +72,27 @@ OOD testing is the "Gold Standard" for evaluating zero-shot TTS. It ensures that
 ## 🛠 Data Processing & Transparency
-We implemented a "Golden Data" strategy to ensure the model learned high-quality Finnish prosody without acoustic artifacts. After strict filtering, the final training set consists of **8,655 high-quality samples**.
-### 1. Multi-Source Dataset Breakdown
-The final dataset is a diverse mix of Finnish speech from the following sources:
-- **Mozilla Common Voice (cv-15)**: 4,348 samples (Diverse crowdsourced voices)
-- **Filmot**: 2,605 samples (Media-based Finnish)
-- **YouTube**: 982 samples (Conversational modern Finnish)
-- **Parliament**: 720 samples (Formal Finnish speech)
-### 2. "Golden" Filtering Logic
-To prevent the model from cloning background noise or learning from single-word clips, we applied the following strict filters in `src/dataset.py`:
-- **Min Duration**: 4.0 seconds (ensures enough context for prosody).
-- **Min SNR**: 35.0 dB (removes low-quality/noisy recordings).
-- **Max SNR**: 100.0 dB (removes sterile/digital noise-gated artifacts).
 ### 3. Traceability & Lineage
-Full lineage is maintained for every training run. The script automatically generates a `dataset_filtering_lineage.csv` in the output directory, detailing exactly which files were excluded and for what reason (`LOW_SNR`, `LOW_DURATION`, or `OOD_SPEAKER`).
 ## 💻 Hardware & Infrastructure
-This training was performed on the **Verda platform** using an **NVIDIA A100 80GB** instance. This high-VRAM instance allowed us to use a larger batch size and 850ms speech sequences without hitting memory limits.
 ### .devcontainer Configuration
-We have included the `.devcontainer` directory to ensure a reproducible environment. It pre-installs all necessary CUDA-optimized libraries and sets up the Jupyter environment for immediate experimentation.
 ---
@@ -121,7 +118,7 @@ from src.chatterbox_.tts import ChatterboxTTS
 engine = ChatterboxTTS.from_local("./pretrained_models", device="cuda")
 # 2. Inject your best finetuned weights
-# (Assuming your best weights are in chatterbox_output/checkpoint-795)
 # engine.t3.load_state_dict(...)
 # 3. Generate with Finnish-optimized parameters
@@ -138,12 +135,8 @@ wav = engine.generate(
 Based on our research, we identified the following settings as the most stable for Finnish phonetics:
 - `repetition_penalty`: 1.2
 - `temperature`: 0.8
-- `Repetition Guard`: Increased to **10 tokens** in `AlignmentStreamAnalyzer` to allow for long Finnish vowels without premature cutoffs.
----
-## 🛡 Repetition Guard Improvements
-A critical fix was applied to `src/chatterbox_/models/t3/inference/alignment_stream_analyzer.py`. The original threshold for token repetition was too sensitive for Finnish (which relies on long vowels). It has been increased from 3 to **10 tokens (~160ms)**, allowing for natural linguistic duration while still preventing infinite generation loops.
 ---
@@ -152,4 +145,3 @@ A critical fix was applied to `src/chatterbox_/models/t3/inference/alignment_str
 - **Exploration Foundation**: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan.
 - **Model Authors**: Deep thanks to the team at **ResembleAI** for releasing the [Chatterbox TTS model](https://huggingface.co/ResembleAI/chatterbox).
 - **Data Sourcing**: Special thanks to **#Jobik** at **Nordic AI** Discord for introducing [Filmot](https://filmot.com/), which was instrumental in sourcing high-quality media-based Finnish data.

 pipeline_tag: text-to-speech
 library_name: pytorch
 model-index:
+- name: Chatterbox Finnish Fine-Tuned (Step 986)
   results:
   - task:
       type: text-to-speech
     metrics:
     - name: Word Error Rate (WER)
       type: wer
+      value: 2.76
       verified: true
     - name: Mean Opinion Score (MOS)
       type: mos
+      value: 4.34
 ---
 # Chatterbox Finnish Fine-Tuning: High-Fidelity Zero-Shot TTS
+This project focuses on fine-tuning the Chatterbox TTS model (based on the Llama architecture) specifically for the Finnish language. By leveraging a multilingual base and optimizing the inference context, we achieved exceptional zero-shot generalization to unseen Finnish speakers, surpassing commercial-grade quality thresholds.
 ## 🚀 Performance Comparison (Zero-Shot OOD)
 The following metrics were calculated on **Out-of-Distribution (OOD)** speakers who were strictly excluded from the training and validation sets. This measures how well the model can speak Finnish in voices it has never heard before.
+| Metric | Baseline (Original Multilingual) | Fine-Tuned (Best Step: 986) | Improvement |
 | :--- | :---: | :---: | :---: |
+| **Avg Word Error Rate (WER)** | 28.94% | **2.76%** | **~10.5x Accuracy Increase** |
+| **Mean Opinion Score (MOS)** | 2.29 / 5.0 | **4.34 / 5.0** | **+2.05 Quality Points** |
+*Note: MOS was evaluated using the Gemini 3 Flash API, and WER was calculated using Faster-Whisper Finnish Large v3. The 4.34 MOS indicates a "Professional Grade" output comparable to human speech.*
 ---
 ## 🛠 Data Processing & Transparency
+We utilized a diverse Finnish dataset to teach the model the nuances of Finnish phonetics, including vowel length and gemination. The final training set consists of **16,604 samples**.
+### 1. Dataset Breakdown
+The dataset is a diverse mix of Finnish speech from the following sources:
+- **Mozilla Common Voice (cv-15)**: Primary source for diverse speaker profiles.
+- **Filmot**: Media-based Finnish for natural conversational flow.
+- **YouTube**: Modern spoken Finnish.
+- **Parliament**: Formal Finnish speech.
+### 2. Zero-Shot Integrity
+To ensure absolute zero-shot performance, we strictly excluded specific speakers (`cv-15_11`, `cv-15_16`, `cv-15_2`) from the training loop. This ensures the 4.34 MOS is a true reflection of the model's ability to generalize to new Finnish voices.
 ### 3. Traceability & Lineage
+Full attribution for the dataset is provided in `attribution.csv`. This file maps every training sample to its speaker ID and source, ensuring transparency and reproducibility.
 ## 💻 Hardware & Infrastructure
+This training was performed on the **Verda platform** using an **NVIDIA A100 80GB** instance. This high-VRAM instance allowed us to use optimal batch sizes and extended speech sequences (up to 1024 tokens) without memory constraints.
 ### .devcontainer Configuration
+We have included the `.devcontainer` directory to ensure a reproducible environment. It pre-installs all necessary CUDA-optimized libraries and sets up the environment for immediate experimentation.
 ---
 engine = ChatterboxTTS.from_local("./pretrained_models", device="cuda")
 # 2. Inject your best finetuned weights
+# (Best weights: best_finnish_multilingual_cp986.safetensors)
 # engine.t3.load_state_dict(...)
 # 3. Generate with Finnish-optimized parameters
 Based on our research, we identified the following settings as the most stable for Finnish phonetics:
 - `repetition_penalty`: 1.2
 - `temperature`: 0.8
+- **Prompt Window**: Increased to **3.0 seconds** during inference to capture the melodic cadence of Finnish sentences.
+- **Repetition Guard**: Increased to **10 tokens** in `AlignmentStreamAnalyzer` to allow for natural long Finnish vowels without premature audio cutoffs.
 ---
 - **Exploration Foundation**: Initial fine-tuning exploration was based on the [chatterbox-finetuning](https://github.com/gokhaneraslan/chatterbox-finetuning) toolkit by gokhaneraslan.
 - **Model Authors**: Deep thanks to the team at **ResembleAI** for releasing the [Chatterbox TTS model](https://huggingface.co/ResembleAI/chatterbox).
 - **Data Sourcing**: Special thanks to **#Jobik** at **Nordic AI** Discord for introducing [Filmot](https://filmot.com/), which was instrumental in sourcing high-quality media-based Finnish data.

attribution.csv CHANGED Viewed

The diff for this file is too large to render. See raw diff

generalization_comparison.png CHANGED Viewed

Git LFS Details

SHA256: 96f6714a0b1a32bf74a3808ac79a961dd9494d94787d36747478f0ca4bf1ff73
Pointer size: 131 Bytes
Size of remote file: 108 kB

inference_example.py CHANGED Viewed

@@ -1,44 +1,51 @@
 import torch
 import soundfile as sf
 from src.chatterbox_.tts import ChatterboxTTS
 from safetensors.torch import load_file
-# ==============================================================================
-# CONFIGURATION
-# ==============================================================================
-# Path to your preferred checkpoint (e.g., CP 795 for best accuracy)
-FINE_TUNED_WEIGHTS = "./models/best_accuracy_cp795.safetensors"
 # Text to synthesize
-TEXT = "Suomen kieli on poikkeuksellisen kaunista kuunneltavaa varsinkin hienosti lausuttuna."
-# Reference audio for voice cloning (3-10s recommended)
 REFERENCE_AUDIO = "./samples/reference_finnish.wav"
 # Output filename
-OUTPUT_FILE = "inference_output.wav"
-# ==============================================================================
 def main():
     device = "cuda" if torch.cuda.is_available() else "cpu"
-    # 1. Load the base engine
-    # Ensure you have run 'python setup.py' to download the base models first
-    print("Loading base engine...")
-    engine = ChatterboxTTS.from_local("./pretrained_models", device=device)
-    # 2. Inject the fine-tuned weights
-    print(f"Injecting fine-tuned weights from {FINE_TUNED_WEIGHTS}...")
-    checkpoint_state = load_file(FINE_TUNED_WEIGHTS)
-    # Strip "t3." prefix if present (added by the trainer wrapper)
-    t3_state_dict = {k[3:] if k.startswith("t3.") else k: v for k, v in checkpoint_state.items()}
-    engine.t3.load_state_dict(t3_state_dict, strict=False)
-    engine.t3.eval()
-    # 3. Generate Finnish audio
-    print(f"Generating audio for text: '{TEXT[:50]}...'")
     wav_tensor = engine.generate(
         text=TEXT,
         audio_prompt_path=REFERENCE_AUDIO,
@@ -46,12 +53,11 @@ def main():
         temperature=0.8,
         exaggeration=0.6
     )
-    # 4. Save result
     wav_np = wav_tensor.squeeze().cpu().numpy()
     sf.write(OUTPUT_FILE, wav_np, engine.sr)
-    print(f"✓ Audio saved to {OUTPUT_FILE}")
 if __name__ == "__main__":
     main()

+import os
 import torch
 import soundfile as sf
 from src.chatterbox_.tts import ChatterboxTTS
 from safetensors.torch import load_file
+# --- CONFIGURABLE VARIABLES ---
+# Path to the directory containing base weights (ve.safetensors, etc.)
+MODEL_DIR = "./pretrained_models"
+# Path to our best finetuned T3 weights
+# In the upload package, this is usually in the 'models' directory
+FINETUNED_WEIGHTS = "./models/best_finnish_multilingual_cp986.safetensors"
 # Text to synthesize
+TEXT = "Tervetuloa kokeilemaan hienoviritettyä suomenkielistä Chatterbox-puhesynteesiä."
+# Reference audio for the speaker identity (Zero-shot)
 REFERENCE_AUDIO = "./samples/reference_finnish.wav"
 # Output filename
+OUTPUT_FILE = "output_finnish.wav"
+# ------------------------------
 def main():
     device = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"Using device: {device}")
+    # 1. Load the base Chatterbox engine
+    print(f"Loading base model from {MODEL_DIR}...")
+    engine = ChatterboxTTS.from_local(MODEL_DIR, device=device)
+    # 2. Inject the finetuned weights
+    if os.path.exists(FINETUNED_WEIGHTS):
+        print(f"Loading finetuned weights from {FINETUNED_WEIGHTS}...")
+        checkpoint_state = load_file(FINETUNED_WEIGHTS)
+        # Strip "t3." prefix if present
+        t3_state_dict = {k[3:] if k.startswith("t3.") else k: v for k, v in checkpoint_state.items()}
+        # Load into the T3 component
+        engine.t3.load_state_dict(t3_state_dict, strict=False)
+    else:
+        print(f"Warning: Finetuned weights not found at {FINETUNED_WEIGHTS}. Using base weights.")
+    # 3. Generate Audio
+    print(f"Generating audio for: '{TEXT}'")
+    # Using optimized parameters for Finnish
     wav_tensor = engine.generate(
         text=TEXT,
         audio_prompt_path=REFERENCE_AUDIO,
         temperature=0.8,
         exaggeration=0.6
     )
+    # 4. Save the result
     wav_np = wav_tensor.squeeze().cpu().numpy()
     sf.write(OUTPUT_FILE, wav_np, engine.sr)
+    print(f"Successfully saved audio to {OUTPUT_FILE}")
 if __name__ == "__main__":
     main()

models/best_finnish_multilingual_cp986.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:198cd1a7ab61ce28355e5e61a6687ee66b5d22982c808010f5f0e08c57d999de
+size 2143990656

samples/comparison/cv15_11_finetuned.wav CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7c095d7a386a0430e8c105cca160e35a5321536b95ed6d3336456f80d5d28695
-size 431084

 version https://git-lfs.github.com/spec/v1
+oid sha256:206e8e5111ba725d9c5df9e8ae2cdb8baacb1d249aae56c8cb6332a5bf717c51
+size 427244

samples/comparison/cv15_16_finetuned.wav CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:049e7435b69864d1f27df2a8a98f1b95d40eee7645fd3d03190512e9380d67b6
-size 358124

 version https://git-lfs.github.com/spec/v1
+oid sha256:2f8877351c3a2246948fb72942057d2e15087932520253224cecb8b90f90fd3f
+size 348524

samples/comparison/cv15_2_finetuned.wav CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fb8142ec3157d3d945e4896c215fca0e1031c520aaa03f8533f53b96f564eb8e
-size 423404

 version https://git-lfs.github.com/spec/v1
+oid sha256:771a4a71375698dbeb1e6858eab29eba319056c77401f2299594998e09c2b1c4
+size 388844

src/__pycache__/config.cpython-311.pyc CHANGED Viewed

Binary files a/src/__pycache__/config.cpython-311.pyc and b/src/__pycache__/config.cpython-311.pyc differ

src/__pycache__/dataset.cpython-311.pyc CHANGED Viewed

Binary files a/src/__pycache__/dataset.cpython-311.pyc and b/src/__pycache__/dataset.cpython-311.pyc differ

src/config.py CHANGED Viewed

@@ -4,21 +4,24 @@ from dataclasses import dataclass
 class TrainConfig:
     # --- Paths ---
     # Directory where setup.py downloaded the files
     model_dir: str = "./pretrained_models"
     # Path to your metadata CSV (Format: ID|RawText|NormText)
-    csv_path: str = "./chatterbox_midtune_cc_data_fill_17_8k/metadata.csv"
     # Directory containing WAV files
-    wav_dir: str = "./chatterbox_midtune_cc_data_fill_17_8k"
     # Attribution file for speaker-aware splitting
-    attribution_path: str = "./chatterbox_midtune_cc_data_fill_17_8k/attribution.csv"
-    preprocessed_dir = "./chatterbox_midtune_cc_data_fill_17_8k/preprocess"
     # Output directory for the finetuned model
-    output_dir: str = "./chatterbox_output"
     ljspeech = True # Set True if the dataset format is ljspeech, and False if it's file-based.
     preprocess = True # If you've already done preprocessing once, set it to false.
@@ -36,10 +39,10 @@ class TrainConfig:
     new_vocab_size: int = 52260 if is_turbo else 2454
     # --- Hyperparameters ---
-    batch_size: int = 32         # Adjust based on VRAM
-    grad_accum: int = 1          # Effective Batch Size = 64
-    learning_rate: float = 2e-5  # Low LR for stable finetuning
-    num_epochs: int = 4         # Run exactly 10 epochs
     weight_decay: float = 0.05   # Defensive weight decay
     # Training Strategy:
@@ -47,7 +50,7 @@ class TrainConfig:
     # Stage 2 (Later):   Single speaker voice clone -> 50-150 epochs, higher LR
     # --- Validation ---
-    validation_split: float = 0.05  # 10% of data for validation
     validation_seed: int = 42      # For reproducible train/val split
     # --- Constraints ---
@@ -57,5 +60,5 @@ class TrainConfig:
     start_text_token = 255
     stop_text_token = 0
     max_text_len: int = 256
-    max_speech_len: int = 850   # Truncates very long audio
     prompt_duration: float = 3.0 # Duration for the reference prompt (seconds)

 class TrainConfig:
     # --- Paths ---
     # Directory where setup.py downloaded the files
+    # Using the original pretrained_models directory which now contains the English-only base weights
     model_dir: str = "./pretrained_models"
     # Path to your metadata CSV (Format: ID|RawText|NormText)
+    csv_path: str = "./chatterbox_midtune_cc_data_16k/metadata.csv"
     # Directory containing WAV files
+    wav_dir: str = "./chatterbox_midtune_cc_data_16k"
     # Attribution file for speaker-aware splitting
+    attribution_path: str = "./chatterbox_midtune_cc_data_16k/attribution.csv"
+    preprocessed_dir = "./chatterbox_midtune_cc_data_16k/preprocess"
     # Output directory for the finetuned model
+    # Changed to differentiate from the English-only run
+    output_dir: str = "./chatterbox_output_multilingual"
     ljspeech = True # Set True if the dataset format is ljspeech, and False if it's file-based.
     preprocess = True # If you've already done preprocessing once, set it to false.
     new_vocab_size: int = 52260 if is_turbo else 2454
     # --- Hyperparameters ---
+    batch_size: int = 16         # Adjust based on VRAM
+    grad_accum: int = 2          # Effective Batch Size = 64
+    learning_rate: float = 2e-5  # Research-optimized LR with warmup
+    num_epochs: int = 5         # Run exactly 5 epochs
     weight_decay: float = 0.05   # Defensive weight decay
     # Training Strategy:
     # Stage 2 (Later):   Single speaker voice clone -> 50-150 epochs, higher LR
     # --- Validation ---
+    validation_split: float = 0.05  # 5% of data for validation
     validation_seed: int = 42      # For reproducible train/val split
     # --- Constraints ---
     start_text_token = 255
     stop_text_token = 0
     max_text_len: int = 256
+    max_speech_len: int = 1024   # Truncates very long audio
     prompt_duration: float = 3.0 # Duration for the reference prompt (seconds)

src/dataset.py CHANGED Viewed

@@ -127,23 +127,41 @@ class ChatterboxDataset(Dataset):
             all_available_speakers = sorted(list(speaker_to_files.keys()))
             if split in ["train", "val"]:
-                # Split speakers instead of files
-                random.seed(config.validation_seed)
-                random.shuffle(all_available_speakers)
-                n_val_spk = max(1, int(len(all_available_speakers) * config.validation_split))
-                val_speakers = set(all_available_speakers[-n_val_spk:])
-                train_speakers = set(all_available_speakers[:-n_val_spk])
-                self.files = []
-                if split == "train":
-                    for spk_id in train_speakers:
-                        self.files.extend(speaker_to_files[spk_id])
-                    logger.info(f"Training dataset: {len(self.files)} files from {len(train_speakers)} speakers.")
-                else: # val
-                    for spk_id in val_speakers:
-                        self.files.extend(speaker_to_files[spk_id])
-                    logger.info(f"Validation dataset: {len(self.files)} files from {len(val_speakers)} speakers.")
             else: # all
                 self.files = []
                 for spk_id in all_available_speakers:

             all_available_speakers = sorted(list(speaker_to_files.keys()))
             if split in ["train", "val"]:
+                # If we only have one speaker, we MUST split at the file level instead of the speaker level
+                if len(all_available_speakers) <= 1:
+                    logger.info("Only one speaker detected. Splitting at file level.")
+                    all_files_to_split = []
+                    for spk_id in all_available_speakers:
+                        all_files_to_split.extend(speaker_to_files[spk_id])
+                    random.seed(config.validation_seed)
+                    random.shuffle(all_files_to_split)
+                    n_val = max(1, int(len(all_files_to_split) * config.validation_split))
+                    if split == "train":
+                        self.files = all_files_to_split[:-n_val]
+                        logger.info(f"Training dataset: {len(self.files)} files (Single Speaker Mode).")
+                    else: # val
+                        self.files = all_files_to_split[-n_val:]
+                        logger.info(f"Validation dataset: {len(self.files)} files (Single Speaker Mode).")
+                else:
+                    # Split speakers instead of files
+                    random.seed(config.validation_seed)
+                    random.shuffle(all_available_speakers)
+                    n_val_spk = max(1, int(len(all_available_speakers) * config.validation_split))
+                    val_speakers = set(all_available_speakers[-n_val_spk:])
+                    train_speakers = set(all_available_speakers[:-n_val_spk])
+                    self.files = []
+                    if split == "train":
+                        for spk_id in train_speakers:
+                            self.files.extend(speaker_to_files[spk_id])
+                        logger.info(f"Training dataset: {len(self.files)} files from {len(train_speakers)} speakers.")
+                    else: # val
+                        for spk_id in val_speakers:
+                            self.files.extend(speaker_to_files[spk_id])
+                        logger.info(f"Validation dataset: {len(self.files)} files from {len(val_speakers)} speakers.")
             else: # all
                 self.files = []
                 for spk_id in all_available_speakers:

train.py CHANGED Viewed

@@ -1,213 +1,215 @@
-import os
-import sys
-import torch
-from transformers import Trainer, TrainingArguments, EarlyStoppingCallback, TrainerCallback
-from safetensors.torch import save_file
-class ChatterboxTrainer(Trainer):
-    """Custom Trainer to log sub-losses for both train and eval."""
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self._eval_loss_text = []
-        self._eval_loss_speech = []
-    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
-        outputs = model(**inputs)
-        loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
-        if isinstance(outputs, dict):
-            if model.training:
-                if self.state.global_step % self.args.logging_steps == 0:
-                    if "loss_text" in outputs:
-                        self.log({"loss_text": outputs["loss_text"].item()})
-                    if "loss_speech" in outputs:
-                        self.log({"loss_speech": outputs["loss_speech"].item()})
-            else:
-                if "loss_text" in outputs:
-                    self._eval_loss_text.append(outputs["loss_text"].item())
-                if "loss_speech" in outputs:
-                    self._eval_loss_speech.append(outputs["loss_speech"].item())
-        return (loss, outputs) if return_outputs else loss
-    def evaluation_loop(self, *args, **kwargs):
-        self._eval_loss_text = []
-        self._eval_loss_speech = []
-        output = super().evaluation_loop(*args, **kwargs)
-        if self._eval_loss_text:
-            output.metrics["eval_loss_text"] = sum(self._eval_loss_text) / len(self._eval_loss_text)
-        if self._eval_loss_speech:
-            output.metrics["eval_loss_speech"] = sum(self._eval_loss_speech) / len(self._eval_loss_speech)
-        return output
-# Internal Modules
-from src.config import TrainConfig
-from src.dataset import ChatterboxDataset, data_collator
-from src.model import resize_and_load_t3_weights, ChatterboxTrainerWrapper
-from src.preprocess_ljspeech import preprocess_dataset_ljspeech
-from src.preprocess_file_based import preprocess_dataset_file_based
-from src.utils import setup_logger, check_pretrained_models
-# Chatterbox Imports
-from src.chatterbox_.tts import ChatterboxTTS
-from src.chatterbox_.tts_turbo import ChatterboxTurboTTS
-from src.chatterbox_.models.t3.t3 import T3
-os.environ["TOKENIZERS_PARALLELISM"] = "false"
-os.environ["WANDB_API_KEY"] = "YOUR_WANDB_API_KEY_HERE"
-os.environ["WANDB_PROJECT"] = "chatterbox-finetuning"
-logger = setup_logger("ChatterboxFinetune")
-def main():
-    cfg = TrainConfig()
-    logger.info("--- Starting Chatterbox Finetuning ---")
-    logger.info(f"Mode: {'CHATTERBOX-TURBO' if cfg.is_turbo else 'CHATTERBOX-TTS'}")
-    # 0. CHECK MODEL FILES
-    mode_check = "chatterbox_turbo" if cfg.is_turbo else "chatterbox"
-    if not check_pretrained_models(mode=mode_check):
-        sys.exit(1)
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    # 1. SELECT THE CORRECT ENGINE CLASS
-    if cfg.is_turbo:
-        EngineClass = ChatterboxTurboTTS
-    else:
-        EngineClass = ChatterboxTTS
-    logger.info(f"Device: {device}")
-    logger.info(f"Model Directory: {cfg.model_dir}")
-    # 2. LOAD ORIGINAL MODEL TEMPORARILY
-    logger.info("Loading original model to extract weights...")
-    # Loading on CPU first to save VRAM
-    tts_engine_original = EngineClass.from_local(cfg.model_dir, device="cpu")
-    pretrained_t3_state_dict = tts_engine_original.t3.state_dict()
-    original_t3_config = tts_engine_original.t3.hp
-    # 3. CREATE NEW T3 MODEL WITH NEW VOCAB SIZE
-    logger.info(f"Creating new T3 model with vocab size: {cfg.new_vocab_size}")
-    new_t3_config = original_t3_config
-    new_t3_config.text_tokens_dict_size = cfg.new_vocab_size
-    # We prevent caching during training.
-    if hasattr(new_t3_config, "use_cache"):
-        new_t3_config.use_cache = False
-    else:
-        setattr(new_t3_config, "use_cache", False)
-    new_t3_model = T3(hp=new_t3_config)
-    # 4. TRANSFER WEIGHTS
-    logger.info("Transferring weights...")
-    new_t3_model = resize_and_load_t3_weights(new_t3_model, pretrained_t3_state_dict)
-    # --- SPECIAL SETTING FOR TURBO ---
-    if cfg.is_turbo:
-        logger.info("Turbo Mode: Removing backbone WTE layer...")
-        if hasattr(new_t3_model.tfmr, "wte"):
-            del new_t3_model.tfmr.wte
-    # Clean up memory
-    del tts_engine_original
-    del pretrained_t3_state_dict
-    # 5. PREPARE ENGINE FOR TRAINING
-    # Reload engine components (VoiceEncoder, S3Gen) but inject our new T3
-    tts_engine_new = EngineClass.from_local(cfg.model_dir, device="cpu")
-    tts_engine_new.t3 = new_t3_model
-    # Freeze other components
-    logger.info("Freezing S3Gen and VoiceEncoder...")
-    for param in tts_engine_new.ve.parameters():
-        param.requires_grad = False
-    for param in tts_engine_new.s3gen.parameters():
-        param.requires_grad = False
-    # Enable Training for T3
-    tts_engine_new.t3.train()
-    for param in tts_engine_new.t3.parameters():
-        param.requires_grad = True
-    if cfg.preprocess:
-        logger.info("Initializing Preprocess dataset...")
-        if cfg.ljspeech:
-            preprocess_dataset_ljspeech(cfg, tts_engine_new)
-        else:
-            preprocess_dataset_file_based(cfg, tts_engine_new)
-    else:
-        logger.info("Skipping the preprocessing dataset step...")
-    # 6. DATASET & WRAPPER
-    logger.info("Initializing Datasets...")
-    train_ds = ChatterboxDataset(cfg, split="train")
-    val_ds = ChatterboxDataset(cfg, split="val")
-    model_wrapper = ChatterboxTrainerWrapper(tts_engine_new.t3)
-    # 7. TRAINING ARGUMENTS
-    training_args = TrainingArguments(
-        output_dir=cfg.output_dir,
-        per_device_train_batch_size=cfg.batch_size,
-        gradient_accumulation_steps=cfg.grad_accum,
-        learning_rate=cfg.learning_rate,
-        weight_decay=cfg.weight_decay, # Added weight decay
-        num_train_epochs=cfg.num_epochs,
-        evaluation_strategy="epoch",  # Evaluate every epoch instead of steps
-        save_strategy="epoch",        # Save every epoch
-        logging_strategy="steps",
-        logging_steps=10,
-        remove_unused_columns=False, # Required for our custom wrapper
-        dataloader_num_workers=16,
-        report_to=["wandb"],
-        fp16=True if torch.cuda.is_available() else False,
-        save_total_limit=10,          # Keep all 10 epoch checkpoints
-        gradient_checkpointing=True, # This setting theoretically reduces VRAM usage by 60%.
-        label_names=["speech_tokens", "text_tokens"],
-        load_best_model_at_end=True, # We want to run exactly 10 epochs
-    )
-    trainer = ChatterboxTrainer(
-        model=model_wrapper,
-        args=training_args,
-        train_dataset=train_ds,
-        eval_dataset=val_ds,
-        data_collator=data_collator,
-        callbacks=[]                  # Removed EarlyStopping
-    )
-    logger.info("Running initial evaluation to establish baseline...")
-    trainer.evaluate()
-    logger.info("Starting Training Loop...")
-    trainer.train()
-    # 8. SAVE FINAL MODEL
-    logger.info("Training complete. Saving model...")
-    os.makedirs(cfg.output_dir, exist_ok=True)
-    filename = "t3_turbo_finetuned.safetensors" if cfg.is_turbo else "t3_finetuned.safetensors"
-    final_model_path = os.path.join(cfg.output_dir, filename)
-    save_file(tts_engine_new.t3.state_dict(), final_model_path)
-    logger.info(f"Model saved to: {final_model_path}")
-if __name__ == "__main__":
-    main()

+import os
+import sys
+import torch
+from transformers import Trainer, TrainingArguments, EarlyStoppingCallback, TrainerCallback
+from safetensors.torch import save_file
+class ChatterboxTrainer(Trainer):
+    """Custom Trainer to log sub-losses for both train and eval."""
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self._eval_loss_text = []
+        self._eval_loss_speech = []
+    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
+        outputs = model(**inputs)
+        loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
+        if isinstance(outputs, dict):
+            if model.training:
+                if self.state.global_step % self.args.logging_steps == 0:
+                    if "loss_text" in outputs:
+                        self.log({"loss_text": outputs["loss_text"].item()})
+                    if "loss_speech" in outputs:
+                        self.log({"loss_speech": outputs["loss_speech"].item()})
+            else:
+                if "loss_text" in outputs:
+                    self._eval_loss_text.append(outputs["loss_text"].item())
+                if "loss_speech" in outputs:
+                    self._eval_loss_speech.append(outputs["loss_speech"].item())
+        return (loss, outputs) if return_outputs else loss
+    def evaluation_loop(self, *args, **kwargs):
+        self._eval_loss_text = []
+        self._eval_loss_speech = []
+        output = super().evaluation_loop(*args, **kwargs)
+        if self._eval_loss_text:
+            output.metrics["eval_loss_text"] = sum(self._eval_loss_text) / len(self._eval_loss_text)
+        if self._eval_loss_speech:
+            output.metrics["eval_loss_speech"] = sum(self._eval_loss_speech) / len(self._eval_loss_speech)
+        return output
+# Internal Modules
+from src.config import TrainConfig
+from src.dataset import ChatterboxDataset, data_collator
+from src.model import resize_and_load_t3_weights, ChatterboxTrainerWrapper
+from src.preprocess_ljspeech import preprocess_dataset_ljspeech
+from src.preprocess_file_based import preprocess_dataset_file_based
+from src.utils import setup_logger, check_pretrained_models
+# Chatterbox Imports
+from src.chatterbox_.tts import ChatterboxTTS
+from src.chatterbox_.tts_turbo import ChatterboxTurboTTS
+from src.chatterbox_.models.t3.t3 import T3
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+os.environ["WANDB_API_KEY"] = "INSERT_API_KEY_HERE"
+os.environ["WANDB_PROJECT"] = "chatterbox-finetuning"
+logger = setup_logger("ChatterboxFinetune")
+def main():
+    cfg = TrainConfig()
+    logger.info("--- Starting Chatterbox Finetuning ---")
+    logger.info(f"Mode: {'CHATTERBOX-TURBO' if cfg.is_turbo else 'CHATTERBOX-TTS'}")
+    # 0. CHECK MODEL FILES
+    mode_check = "chatterbox_turbo" if cfg.is_turbo else "chatterbox"
+    if not check_pretrained_models(mode=mode_check):
+        sys.exit(1)
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    # 1. SELECT THE CORRECT ENGINE CLASS
+    if cfg.is_turbo:
+        EngineClass = ChatterboxTurboTTS
+    else:
+        EngineClass = ChatterboxTTS
+    logger.info(f"Device: {device}")
+    logger.info(f"Model Directory: {cfg.model_dir}")
+    # 2. LOAD ORIGINAL MODEL TEMPORARILY
+    logger.info("Loading original model to extract weights...")
+    # Loading on CPU first to save VRAM
+    tts_engine_original = EngineClass.from_local(cfg.model_dir, device="cpu")
+    pretrained_t3_state_dict = tts_engine_original.t3.state_dict()
+    original_t3_config = tts_engine_original.t3.hp
+    # 3. CREATE NEW T3 MODEL WITH NEW VOCAB SIZE
+    logger.info(f"Creating new T3 model with vocab size: {cfg.new_vocab_size}")
+    new_t3_config = original_t3_config
+    new_t3_config.text_tokens_dict_size = cfg.new_vocab_size
+    # We prevent caching during training.
+    if hasattr(new_t3_config, "use_cache"):
+        new_t3_config.use_cache = False
+    else:
+        setattr(new_t3_config, "use_cache", False)
+    new_t3_model = T3(hp=new_t3_config)
+    # 4. TRANSFER WEIGHTS
+    logger.info("Transferring weights...")
+    new_t3_model = resize_and_load_t3_weights(new_t3_model, pretrained_t3_state_dict)
+    # --- SPECIAL SETTING FOR TURBO ---
+    if cfg.is_turbo:
+        logger.info("Turbo Mode: Removing backbone WTE layer...")
+        if hasattr(new_t3_model.tfmr, "wte"):
+            del new_t3_model.tfmr.wte
+    # Clean up memory
+    del tts_engine_original
+    del pretrained_t3_state_dict
+    # 5. PREPARE ENGINE FOR TRAINING
+    # Reload engine components (VoiceEncoder, S3Gen) but inject our new T3
+    tts_engine_new = EngineClass.from_local(cfg.model_dir, device="cpu")
+    tts_engine_new.t3 = new_t3_model
+    # Freeze other components
+    logger.info("Freezing S3Gen and VoiceEncoder...")
+    for param in tts_engine_new.ve.parameters():
+        param.requires_grad = False
+    for param in tts_engine_new.s3gen.parameters():
+        param.requires_grad = False
+    # Enable Training for T3
+    tts_engine_new.t3.train()
+    for param in tts_engine_new.t3.parameters():
+        param.requires_grad = True
+    if cfg.preprocess:
+        logger.info("Initializing Preprocess dataset...")
+        if cfg.ljspeech:
+            preprocess_dataset_ljspeech(cfg, tts_engine_new)
+        else:
+            preprocess_dataset_file_based(cfg, tts_engine_new)
+    else:
+        logger.info("Skipping the preprocessing dataset step...")
+    # 6. DATASET & WRAPPER
+    logger.info("Initializing Datasets...")
+    train_ds = ChatterboxDataset(cfg, split="train")
+    val_ds = ChatterboxDataset(cfg, split="val")
+    model_wrapper = ChatterboxTrainerWrapper(tts_engine_new.t3)
+    # 7. TRAINING ARGUMENTS
+    training_args = TrainingArguments(
+        output_dir=cfg.output_dir,
+        per_device_train_batch_size=cfg.batch_size,
+        gradient_accumulation_steps=cfg.grad_accum,
+        learning_rate=cfg.learning_rate,
+        weight_decay=cfg.weight_decay, # Added weight decay
+        num_train_epochs=cfg.num_epochs,
+        evaluation_strategy="epoch",
+        save_strategy="epoch",
+        logging_strategy="steps",
+        logging_steps=10,
+        remove_unused_columns=False, # Required for our custom wrapper
+        dataloader_num_workers=16,
+        report_to=["wandb"],
+        bf16=True if torch.cuda.is_available() else False, # Using bf16 for A100
+        save_total_limit=5,          # Keep all epoch checkpoints
+        gradient_checkpointing=False, # This setting theoretically reduces VRAM usage by 60%.
+        label_names=["speech_tokens", "text_tokens"],
+        load_best_model_at_end=True,
+        lr_scheduler_type="cosine",    # Research-optimized scheduler
+        warmup_ratio=0.1,              # 10% warmup to handle English-to-Finnish transition
+    )
+    trainer = ChatterboxTrainer(
+        model=model_wrapper,
+        args=training_args,
+        train_dataset=train_ds,
+        eval_dataset=val_ds,
+        data_collator=data_collator,
+        callbacks=[]                  # Removed EarlyStopping
+    )
+    logger.info("Running initial evaluation to establish baseline...")
+    trainer.evaluate()
+    logger.info("Starting Training Loop...")
+    trainer.train()
+    # 8. SAVE FINAL MODEL
+    logger.info("Training complete. Saving model...")
+    os.makedirs(cfg.output_dir, exist_ok=True)
+    filename = "t3_turbo_finetuned.safetensors" if cfg.is_turbo else "t3_finetuned.safetensors"
+    final_model_path = os.path.join(cfg.output_dir, filename)
+    save_file(tts_engine_new.t3.state_dict(), final_model_path)
+    logger.info(f"Model saved to: {final_model_path}")
+if __name__ == "__main__":
+    main()