Chatterbox TTS — Bangla Fine-tuned

A fine-tuned version of Resemble AI's Chatterbox TTS adapted for Bangla (Bengali) speech synthesis with zero-shot voice cloning.

Base model ResembleAI/chatterbox
Fine-tuned component T3 (text-to-token transformer)
Language Bangla (বাংলা)
Training steps ~888,000
Training hardware H100 80GB via Modal
Best quality checkpoint ~456,000 steps

Table of Contents

  1. Model Architecture
  2. Quick Inference
  3. Full Training Pipeline
  4. Hyperparameter Reference
  5. Troubleshooting

Model Architecture

Chatterbox TTS has three components. Only T3 is fine-tuned — the rest are frozen:

Text  ──►  T3 (fine-tuned) ──► Speech Tokens ──► S3Gen (frozen) ──► Waveform
                ▲
         VoiceEncoder (frozen)
                ▲
         Reference Audio
Component Role Fine-tuned?
T3 LLM-style text → speech token prediction ✅ Yes
S3Gen Speech tokens → mel spectrogram → waveform ❌ Frozen
VoiceEncoder (VE) Encodes reference speaker audio ❌ Frozen

This approach is fast, stable, and avoids catastrophic forgetting of the speech decoder.


Quick Inference

Installation

pip install chatterbox-tts==0.1.2 safetensors soundfile silero-vad==6.2.0 huggingface_hub

Download model files

from huggingface_hub import hf_hub_download

weights_path = hf_hub_download(
    repo_id="EMTIAZZ/chatterbox-bangla-tts",
    filename="t3_bangla_888k.safetensors"
)
tokenizer_path = hf_hub_download(
    repo_id="EMTIAZZ/chatterbox-bangla-tts",
    filename="tokenizer.json"
)

Download base Chatterbox pretrained files

The base model files (S3Gen, VoiceEncoder, etc.) are from ResembleAI and must be downloaded separately:

import os, requests
from tqdm import tqdm

DEST_DIR = "./pretrained_models"
os.makedirs(DEST_DIR, exist_ok=True)

BASE_FILES = {
    "ve.safetensors":    "https://huggingface.co/ResembleAI/chatterbox/resolve/main/ve.safetensors?download=true",
    "t3_cfg.safetensors":"https://huggingface.co/ResembleAI/chatterbox/resolve/main/t3_cfg.safetensors?download=true",
    "s3gen.safetensors": "https://huggingface.co/ResembleAI/chatterbox/resolve/main/s3gen.safetensors?download=true",
    "conds.pt":          "https://huggingface.co/ResembleAI/chatterbox/resolve/main/conds.pt?download=true",
}

for fname, url in BASE_FILES.items():
    dest = os.path.join(DEST_DIR, fname)
    if not os.path.exists(dest):
        r = requests.get(url, stream=True)
        with open(dest, "wb") as f:
            for chunk in r.iter_content(1024 * 1024):
                f.write(chunk)
        print(f"Downloaded {fname}")

# Copy the Bangla tokenizer into pretrained_models
import shutil
shutil.copy(tokenizer_path, os.path.join(DEST_DIR, "tokenizer.json"))
print("Tokenizer ready")

Run inference

import torch
import soundfile as sf
import numpy as np
from safetensors.torch import load_file
from chatterbox.tts import ChatterboxTTS
from chatterbox.models.t3.t3 import T3

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BASE_MODEL_DIR = "./pretrained_models"
FINETUNED_WEIGHTS = weights_path      # from hf_hub_download
AUDIO_PROMPT = "./your_reference.wav" # 3–6 sec clean Bangla speech
NEW_VOCAB_SIZE = 4240

# Load base engine
tts = ChatterboxTTS.from_local(BASE_MODEL_DIR, device="cpu")

# Rebuild T3 with extended Bangla vocab
t3_cfg = tts.t3.hp
t3_cfg.text_tokens_dict_size = NEW_VOCAB_SIZE
new_t3 = T3(hp=t3_cfg)

# Load fine-tuned weights (strip HF Trainer wrapper prefix if present)
state_dict = load_file(FINETUNED_WEIGHTS, device="cpu")
if any(k.startswith("t3.") for k in state_dict):
    state_dict = {k[len("t3."):]: v for k, v in state_dict.items() if k.startswith("t3.")}
new_t3.load_state_dict(state_dict, strict=True)

# Swap T3 into engine and move to device
tts.t3 = new_t3
tts.t3.to(DEVICE).eval()
tts.s3gen.to(DEVICE).eval()
tts.ve.to(DEVICE).eval()
tts.device = DEVICE

# Generate speech
text = "আমাদের গ্রাহক সেবায় আপনাকে স্বাগতম। আপনার যেকোনো সমস্যায় আমরা সাহায্য করতে প্রস্তুত।"

wav = tts.generate(
    text=text,
    audio_prompt_path=AUDIO_PROMPT,
    temperature=0.3,
    exaggeration=0.5,
    cfg_weight=0.5,
    repetition_penalty=1.2,
    min_new_tokens=150,
)

sf.write("output.wav", wav.squeeze().cpu().numpy(), tts.sr)
print("Saved to output.wav")

Splitting long text into sentences (better quality)

import re, numpy as np, soundfile as sf

def synthesize_long_text(tts, text, audio_prompt, output_path, **params):
    # Split on sentence boundaries including Bengali danda (।)
    sentences = re.split(r'(?<=[.?!।])\s*', text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]

    chunks = []
    sr = tts.sr

    for i, sent in enumerate(sentences):
        print(f"[{i+1}/{len(sentences)}] {sent}")
        wav = tts.generate(text=sent, audio_prompt_path=audio_prompt, **params)
        chunks.append(wav.squeeze().cpu().numpy())
        chunks.append(np.zeros(int(sr * 0.25)))  # 250ms pause between sentences

    final = np.concatenate(chunks)
    sf.write(output_path, final, sr)
    print(f"Saved to {output_path}")

synthesize_long_text(
    tts,
    text="আজকের আবহাওয়া বেশ সুন্দর। আমি বাজার থেকে তাজা সবজি কিনে এনেছি। রাতের খাবারে ভাত আর মাছের তরকারি রান্না হবে।",
    audio_prompt=AUDIO_PROMPT,
    output_path="output_long.wav",
    temperature=0.3,
    exaggeration=0.5,
    cfg_weight=0.5,
    repetition_penalty=1.2,
    min_new_tokens=150,
)

Reference audio tips:

  • Use a 3–6 second clean Bangla speech clip from the target speaker
  • Mono WAV, no background noise or music
  • The model clones the voice style from this reference — quality depends heavily on it

Full Training Pipeline

This section covers how to reproduce this fine-tuned model from scratch, or adapt it further for your own Bangla dataset.

Step 1: Clone & Install

git clone https://github.com/EMTIAZZ/chatterbox-bangla-tts
cd chatterbox-bangla-tts

pip install -r requirements.txt

requirements.txt:

peft==0.17.1
torch==2.6.0
torchaudio==2.6.0
chatterbox-tts==0.1.2
silero-vad==6.2.0
librosa==0.11.0
soundfile==0.13.1
num2words
pandas
safetensors
tensorboard
omegaconf
pyloudnorm
huggingface_hub
hf_transfer

Step 2: Download Base Pretrained Models

python setup.py

This downloads the following files from ResembleAI into ./pretrained_models/:

File Description
ve.safetensors VoiceEncoder weights
t3_cfg.safetensors T3 model weights (base)
s3gen.safetensors S3Gen decoder weights
conds.pt Conditioning tensors
tokenizer.json Base grapheme tokenizer

Step 3: Prepare Your Dataset

Create an LJSpeech-format dataset:

MyTTSDataset/
├── metadata.csv
└── wavs/
    ├── bn_001.wav
    ├── bn_002.wav
    └── ...

metadata.csv — pipe-separated, no header:

bn_001|আমি বাংলায় কথা বলছি।|আমি বাংলায় কথা বলছি।
bn_002|আজকের আবহাওয়া সুন্দর।|আজকের আবহাওয়া সুন্দর।
bn_003|আপনাকে স্বাগতম।|আপনাকে স্বাগতম।

Format: ID|RawText|NormalizedText

Audio requirements:

  • Format: WAV, mono
  • Sample rate: 22050 Hz or 24000 Hz (auto-resampled during preprocessing)
  • Duration per clip: 2–12 seconds (shorter clips train better)
  • Clean speech, no background noise or music
  • Recommended dataset size: 2–10 hours

Recommended: normalize your audio first:

# Normalize loudness to -23 LUFS using ffmpeg
for f in MyTTSDataset/wavs/*.wav; do
    ffmpeg -i "$f" -af loudnorm=I=-23:LRA=7:TP=-2 "${f%.wav}_norm.wav" -y
done

Step 4: Bangla Tokenizer Adaptation

The base Chatterbox tokenizer does not contain Bangla Unicode characters. This step extends it.

You need an XTTS vocab.json that already contains Bangla tokens. You can get one from a pre-existing XTTS Bangla model, or use the one in this repo.

# Place your XTTS vocab.json at ./xtts_vocab.json, then run:
python add_bangla_tokens.py

What this script does:

  1. Loads the existing Chatterbox tokenizer.json
  2. Extracts all Bangla Unicode characters (U+0980–U+09FF) and BPE subwords from xtts_vocab.json
  3. Appends them to the Chatterbox tokenizer with new sequential IDs
  4. Adds Bengali dari as a punctuation token
  5. Adds BPE merge rules for Bengali subwords
  6. Saves the extended tokenizer back to pretrained_models/tokenizer.json

Output:

SUCCESS! Added 1240 new tokens
New vocab size: 4240
*** IMPORTANT: Update new_vocab_size in src/config.py to: 4240 ***

Update src/config.py with the printed vocab size:

new_vocab_size: int = 4240  # <- update this to match the printed value

Step 5: Configure Training

Edit src/config.py:

from dataclasses import dataclass

@dataclass
class TrainConfig:
    # --- Paths ---
    model_dir:        str = "./pretrained_models"
    csv_path:         str = "./MyTTSDataset/metadata.csv"
    wav_dir:          str = "./MyTTSDataset/wavs"
    preprocessed_dir: str = "./MyTTSDataset/preprocess"
    output_dir:       str = "./chatterbox_output"

    # --- Mode ---
    ljspeech:   bool = True   # True = LJSpeech CSV format
    json_format: bool = False  # True = JSON format
    preprocess: bool = True   # Set False after first run
    is_turbo:   bool = False  # False = normal Chatterbox, True = Turbo

    # --- Vocab (must match add_bangla_tokens.py output) ---
    new_vocab_size: int = 4240

    # --- Hyperparameters ---
    batch_size:    int   = 4      # adjust for your GPU VRAM
    grad_accum:    int   = 2      # effective batch = batch_size × grad_accum
    learning_rate: float = 5e-6   # keep low — T3 is sensitive
    num_epochs:    int   = 50

    save_steps:       int = 2000
    save_total_limit: int = 3

    # --- Constraints ---
    max_text_len:    int   = 256
    max_speech_len:  int   = 850   # truncates clips longer than ~8s
    prompt_duration: float = 3.0   # reference audio duration (seconds)

Batch size guide by VRAM:

VRAM batch_size grad_accum Effective batch
8 GB 2 4 8
16 GB 4 2 8
24 GB 8 1 8
40 GB 16 1 16
80 GB (H100) 24 1 24

Step 6: Preprocess Dataset

Preprocessing encodes every audio clip into discrete speech tokens (S3 codes) and saves them as .pt files. This only needs to be run once — subsequent training runs skip it.

# First run — preprocessing is ON by default (preprocess=True in config)
python train.py

The preprocessor will:

  1. Load each WAV file from wav_dir
  2. Resample to 24000 Hz if needed
  3. Extract a 3-second voice conditioning prompt from the start
  4. Encode audio → S3 speech tokens using S3Gen
  5. Tokenize text → token IDs using the extended tokenizer
  6. Save each sample as a .pt file in preprocessed_dir

Expected output:

Preprocessing sample 1/5000: bn_001 ...
Preprocessing sample 2/5000: bn_002 ...
...
Preprocessing complete. 4987 samples saved to ./MyTTSDataset/preprocess/

After preprocessing completes, set preprocess = False in src/config.py to skip it on future runs.


Step 7: Train the Model

# Make sure preprocess=False if you've already preprocessed
python train.py

What train.py does internally:

# 1. Load original Chatterbox T3 weights
tts = ChatterboxTTS.from_local(cfg.model_dir, device="cpu")

# 2. Create a new T3 with the extended Bangla vocab size
t3_cfg = tts.t3.hp
t3_cfg.text_tokens_dict_size = cfg.new_vocab_size   # e.g. 4240
new_t3 = T3(hp=t3_cfg)

# 3. Transfer all original weights; randomly init only the new embedding rows
new_t3 = resize_and_load_t3_weights(new_t3, tts.t3.state_dict())

# 4. Freeze S3Gen and VoiceEncoder — only T3 trains
for param in tts.s3gen.parameters(): param.requires_grad = False
for param in tts.ve.parameters():    param.requires_grad = False
for param in new_t3.parameters():    param.requires_grad = True

# 5. HuggingFace Trainer with cosine LR, weight decay, gradient checkpointing
trainer = Trainer(
    model=ChatterboxTrainerWrapper(new_t3),
    args=TrainingArguments(
        learning_rate=5e-6,
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        weight_decay=0.01,
        bf16=True,
        gradient_checkpointing=True,
        ...
    ),
)

# 6. Auto-resume from latest checkpoint if one exists
trainer.train(resume_from_checkpoint=last_ckpt)

Monitoring training with TensorBoard:

tensorboard --logdir ./chatterbox_output

Training auto-resumes from the latest checkpoint in ./chatterbox_output/ if you stop and restart.


Step 8: Train on Modal (Cloud — Recommended)

For H100 training (~10× faster than a 24GB GPU):

Install Modal:

pip install modal
modal setup   # opens browser for authentication

Create Modal volumes (one-time setup):

modal volume create xtts-finetune-data
modal volume create chatterbox-v2-output

Upload your dataset to Modal volume:

# Upload metadata CSV
modal volume put xtts-finetune-data ./MyTTSDataset/metadata.csv dataset/metadata_train.csv

# Upload WAV files (use a loop for large datasets)
modal volume put xtts-finetune-data ./MyTTSDataset/wavs/ dataset/wavs/

Configure train_modal.py:

vol     = modal.Volume.from_name("xtts-finetune-data")
vol_out = modal.Volume.from_name("chatterbox-v2-output")

CSV_PATH = "/data/dataset/metadata_train.csv"
WAV_DIR  = "/data/dataset/wavs"

Run training:

# Launch (detached — runs in background)
python -m modal run --detach train_modal.py

# Monitor logs
modal app logs <app-id>

Download a checkpoint after training:

# List available checkpoints
python -m modal volume ls chatterbox-v2-output

# Download a specific checkpoint's weights
python -m modal volume get chatterbox-v2-output \
    checkpoint-456000/model.safetensors \
    ./chatterbox_output/checkpoint-456000_model.safetensors

H100 training speed reference:

Dataset size Batch Steps/epoch Time to 500k steps
5h (~8k clips) 24 ~333 ~12 hours
10h (~16k clips) 24 ~667 ~24 hours

Step 9: Export Checkpoint for Inference

HuggingFace Trainer saves full checkpoints as checkpoint-XXXXXX/model.safetensors inside output_dir. These are the T3 weights wrapped with a t3. key prefix.

The inference script handles this automatically:

state_dict = load_file(weights_path, device="cpu")
# Strip HF Trainer wrapper prefix
if any(k.startswith("t3.") for k in state_dict):
    state_dict = {k[len("t3."):]: v for k, v in state_dict.items() if k.startswith("t3.")}
new_t3.load_state_dict(state_dict, strict=True)

You can also flatten the checkpoint for distribution:

from safetensors.torch import load_file, save_file

state_dict = load_file("./chatterbox_output/checkpoint-456000/model.safetensors")
# Strip prefix
state_dict = {k[len("t3."):]: v for k, v in state_dict.items() if k.startswith("t3.")}
save_file(state_dict, "./t3_bangla_456k_clean.safetensors")

Hyperparameter Reference

Training

Parameter Value used Notes
learning_rate 5e-6 Lower than typical LLM fine-tuning — T3 is sensitive
lr_scheduler_type cosine Smooth decay, better than constant LR
warmup_ratio 0.05 5% of total steps as warmup
weight_decay 0.01 L2 regularization against overfitting
bf16 True Faster on A100/H100; use fp16=True on older GPUs
gradient_checkpointing True Saves ~40% VRAM at ~20% speed cost
batch_size 24 (H100) Scale down for smaller GPUs

Inference

Parameter Recommended Notes
temperature 0.3 Lower = more stable Bangla; higher = more expressive
exaggeration 0.5 Voice style intensity (0 = neutral, 1 = strong)
cfg_weight 0.5 Classifier-free guidance strength
repetition_penalty 1.2 Reduces token repetition loops
min_new_tokens 150 Prevents early truncation of speech

Troubleshooting

Garbage audio / no speech from a later checkpoint:

  • This is overfitting. Quality typically peaks around 400k–500k steps for a 5h dataset. Beyond that the model degrades.
  • Use an earlier checkpoint (checkpoint-456000 recommended over checkpoint-888000 for this training run).
  • To prevent this: add load_best_model_at_end=True with a validation split in TrainingArguments.

KeyError or size mismatch when loading weights:

  • Ensure new_vocab_size in src/config.py exactly matches the number printed by add_bangla_tokens.py.
  • If using HF Trainer checkpoint, make sure the t3. prefix stripping code is applied.

Out of memory (OOM) during training:

  • Reduce batch_size by half and double grad_accum to keep effective batch size the same.
  • Enable gradient_checkpointing=True (already on by default).

Preprocessing is very slow:

  • Normal — encoding audio to S3 codes runs on CPU by default if no GPU is available.
  • On GPU it takes ~1–2 hours for 10h of audio; on CPU expect 4–8 hours.
  • Set preprocess=False after the first run to skip it.

Reference audio not matching voice:

  • Reference audio must be clean and at least 3 seconds long.
  • The speaker in the reference should ideally match the training speaker.
  • Try recording a new reference with the same mic/conditions as your training data.

Training Repository

Full training code, scripts, and guides: github.com/EMTIAZZ/chatterbox-bangla-tts


License

This fine-tuned model follows the same license as the base Chatterbox TTS model (MIT). Please refer to Resemble AI's terms for commercial use.


Credits

Downloads last month
87
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using EMTIAZZ/chatterbox-bangla-tts 1