Chatterbox TTS — Bangla Fine-tuned
A fine-tuned version of Resemble AI's Chatterbox TTS adapted for Bangla (Bengali) speech synthesis with zero-shot voice cloning.
| Base model | ResembleAI/chatterbox |
| Fine-tuned component | T3 (text-to-token transformer) |
| Language | Bangla (বাংলা) |
| Training steps | ~888,000 |
| Training hardware | H100 80GB via Modal |
| Best quality checkpoint | ~456,000 steps |
Table of Contents
Model Architecture
Chatterbox TTS has three components. Only T3 is fine-tuned — the rest are frozen:
Text ──► T3 (fine-tuned) ──► Speech Tokens ──► S3Gen (frozen) ──► Waveform
▲
VoiceEncoder (frozen)
▲
Reference Audio
| Component | Role | Fine-tuned? |
|---|---|---|
| T3 | LLM-style text → speech token prediction | ✅ Yes |
| S3Gen | Speech tokens → mel spectrogram → waveform | ❌ Frozen |
| VoiceEncoder (VE) | Encodes reference speaker audio | ❌ Frozen |
This approach is fast, stable, and avoids catastrophic forgetting of the speech decoder.
Quick Inference
Installation
pip install chatterbox-tts==0.1.2 safetensors soundfile silero-vad==6.2.0 huggingface_hub
Download model files
from huggingface_hub import hf_hub_download
weights_path = hf_hub_download(
repo_id="EMTIAZZ/chatterbox-bangla-tts",
filename="t3_bangla_888k.safetensors"
)
tokenizer_path = hf_hub_download(
repo_id="EMTIAZZ/chatterbox-bangla-tts",
filename="tokenizer.json"
)
Download base Chatterbox pretrained files
The base model files (S3Gen, VoiceEncoder, etc.) are from ResembleAI and must be downloaded separately:
import os, requests
from tqdm import tqdm
DEST_DIR = "./pretrained_models"
os.makedirs(DEST_DIR, exist_ok=True)
BASE_FILES = {
"ve.safetensors": "https://huggingface.co/ResembleAI/chatterbox/resolve/main/ve.safetensors?download=true",
"t3_cfg.safetensors":"https://huggingface.co/ResembleAI/chatterbox/resolve/main/t3_cfg.safetensors?download=true",
"s3gen.safetensors": "https://huggingface.co/ResembleAI/chatterbox/resolve/main/s3gen.safetensors?download=true",
"conds.pt": "https://huggingface.co/ResembleAI/chatterbox/resolve/main/conds.pt?download=true",
}
for fname, url in BASE_FILES.items():
dest = os.path.join(DEST_DIR, fname)
if not os.path.exists(dest):
r = requests.get(url, stream=True)
with open(dest, "wb") as f:
for chunk in r.iter_content(1024 * 1024):
f.write(chunk)
print(f"Downloaded {fname}")
# Copy the Bangla tokenizer into pretrained_models
import shutil
shutil.copy(tokenizer_path, os.path.join(DEST_DIR, "tokenizer.json"))
print("Tokenizer ready")
Run inference
import torch
import soundfile as sf
import numpy as np
from safetensors.torch import load_file
from chatterbox.tts import ChatterboxTTS
from chatterbox.models.t3.t3 import T3
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BASE_MODEL_DIR = "./pretrained_models"
FINETUNED_WEIGHTS = weights_path # from hf_hub_download
AUDIO_PROMPT = "./your_reference.wav" # 3–6 sec clean Bangla speech
NEW_VOCAB_SIZE = 4240
# Load base engine
tts = ChatterboxTTS.from_local(BASE_MODEL_DIR, device="cpu")
# Rebuild T3 with extended Bangla vocab
t3_cfg = tts.t3.hp
t3_cfg.text_tokens_dict_size = NEW_VOCAB_SIZE
new_t3 = T3(hp=t3_cfg)
# Load fine-tuned weights (strip HF Trainer wrapper prefix if present)
state_dict = load_file(FINETUNED_WEIGHTS, device="cpu")
if any(k.startswith("t3.") for k in state_dict):
state_dict = {k[len("t3."):]: v for k, v in state_dict.items() if k.startswith("t3.")}
new_t3.load_state_dict(state_dict, strict=True)
# Swap T3 into engine and move to device
tts.t3 = new_t3
tts.t3.to(DEVICE).eval()
tts.s3gen.to(DEVICE).eval()
tts.ve.to(DEVICE).eval()
tts.device = DEVICE
# Generate speech
text = "আমাদের গ্রাহক সেবায় আপনাকে স্বাগতম। আপনার যেকোনো সমস্যায় আমরা সাহায্য করতে প্রস্তুত।"
wav = tts.generate(
text=text,
audio_prompt_path=AUDIO_PROMPT,
temperature=0.3,
exaggeration=0.5,
cfg_weight=0.5,
repetition_penalty=1.2,
min_new_tokens=150,
)
sf.write("output.wav", wav.squeeze().cpu().numpy(), tts.sr)
print("Saved to output.wav")
Splitting long text into sentences (better quality)
import re, numpy as np, soundfile as sf
def synthesize_long_text(tts, text, audio_prompt, output_path, **params):
# Split on sentence boundaries including Bengali danda (।)
sentences = re.split(r'(?<=[.?!।])\s*', text.strip())
sentences = [s.strip() for s in sentences if s.strip()]
chunks = []
sr = tts.sr
for i, sent in enumerate(sentences):
print(f"[{i+1}/{len(sentences)}] {sent}")
wav = tts.generate(text=sent, audio_prompt_path=audio_prompt, **params)
chunks.append(wav.squeeze().cpu().numpy())
chunks.append(np.zeros(int(sr * 0.25))) # 250ms pause between sentences
final = np.concatenate(chunks)
sf.write(output_path, final, sr)
print(f"Saved to {output_path}")
synthesize_long_text(
tts,
text="আজকের আবহাওয়া বেশ সুন্দর। আমি বাজার থেকে তাজা সবজি কিনে এনেছি। রাতের খাবারে ভাত আর মাছের তরকারি রান্না হবে।",
audio_prompt=AUDIO_PROMPT,
output_path="output_long.wav",
temperature=0.3,
exaggeration=0.5,
cfg_weight=0.5,
repetition_penalty=1.2,
min_new_tokens=150,
)
Reference audio tips:
- Use a 3–6 second clean Bangla speech clip from the target speaker
- Mono WAV, no background noise or music
- The model clones the voice style from this reference — quality depends heavily on it
Full Training Pipeline
This section covers how to reproduce this fine-tuned model from scratch, or adapt it further for your own Bangla dataset.
Step 1: Clone & Install
git clone https://github.com/EMTIAZZ/chatterbox-bangla-tts
cd chatterbox-bangla-tts
pip install -r requirements.txt
requirements.txt:
peft==0.17.1
torch==2.6.0
torchaudio==2.6.0
chatterbox-tts==0.1.2
silero-vad==6.2.0
librosa==0.11.0
soundfile==0.13.1
num2words
pandas
safetensors
tensorboard
omegaconf
pyloudnorm
huggingface_hub
hf_transfer
Step 2: Download Base Pretrained Models
python setup.py
This downloads the following files from ResembleAI into ./pretrained_models/:
| File | Description |
|---|---|
ve.safetensors |
VoiceEncoder weights |
t3_cfg.safetensors |
T3 model weights (base) |
s3gen.safetensors |
S3Gen decoder weights |
conds.pt |
Conditioning tensors |
tokenizer.json |
Base grapheme tokenizer |
Step 3: Prepare Your Dataset
Create an LJSpeech-format dataset:
MyTTSDataset/
├── metadata.csv
└── wavs/
├── bn_001.wav
├── bn_002.wav
└── ...
metadata.csv — pipe-separated, no header:
bn_001|আমি বাংলায় কথা বলছি।|আমি বাংলায় কথা বলছি।
bn_002|আজকের আবহাওয়া সুন্দর।|আজকের আবহাওয়া সুন্দর।
bn_003|আপনাকে স্বাগতম।|আপনাকে স্বাগতম।
Format: ID|RawText|NormalizedText
Audio requirements:
- Format: WAV, mono
- Sample rate: 22050 Hz or 24000 Hz (auto-resampled during preprocessing)
- Duration per clip: 2–12 seconds (shorter clips train better)
- Clean speech, no background noise or music
- Recommended dataset size: 2–10 hours
Recommended: normalize your audio first:
# Normalize loudness to -23 LUFS using ffmpeg
for f in MyTTSDataset/wavs/*.wav; do
ffmpeg -i "$f" -af loudnorm=I=-23:LRA=7:TP=-2 "${f%.wav}_norm.wav" -y
done
Step 4: Bangla Tokenizer Adaptation
The base Chatterbox tokenizer does not contain Bangla Unicode characters. This step extends it.
You need an XTTS vocab.json that already contains Bangla tokens. You can get one from a pre-existing XTTS Bangla model, or use the one in this repo.
# Place your XTTS vocab.json at ./xtts_vocab.json, then run:
python add_bangla_tokens.py
What this script does:
- Loads the existing Chatterbox
tokenizer.json - Extracts all Bangla Unicode characters (U+0980–U+09FF) and BPE subwords from
xtts_vocab.json - Appends them to the Chatterbox tokenizer with new sequential IDs
- Adds Bengali dari
।as a punctuation token - Adds BPE merge rules for Bengali subwords
- Saves the extended tokenizer back to
pretrained_models/tokenizer.json
Output:
SUCCESS! Added 1240 new tokens
New vocab size: 4240
*** IMPORTANT: Update new_vocab_size in src/config.py to: 4240 ***
Update src/config.py with the printed vocab size:
new_vocab_size: int = 4240 # <- update this to match the printed value
Step 5: Configure Training
Edit src/config.py:
from dataclasses import dataclass
@dataclass
class TrainConfig:
# --- Paths ---
model_dir: str = "./pretrained_models"
csv_path: str = "./MyTTSDataset/metadata.csv"
wav_dir: str = "./MyTTSDataset/wavs"
preprocessed_dir: str = "./MyTTSDataset/preprocess"
output_dir: str = "./chatterbox_output"
# --- Mode ---
ljspeech: bool = True # True = LJSpeech CSV format
json_format: bool = False # True = JSON format
preprocess: bool = True # Set False after first run
is_turbo: bool = False # False = normal Chatterbox, True = Turbo
# --- Vocab (must match add_bangla_tokens.py output) ---
new_vocab_size: int = 4240
# --- Hyperparameters ---
batch_size: int = 4 # adjust for your GPU VRAM
grad_accum: int = 2 # effective batch = batch_size × grad_accum
learning_rate: float = 5e-6 # keep low — T3 is sensitive
num_epochs: int = 50
save_steps: int = 2000
save_total_limit: int = 3
# --- Constraints ---
max_text_len: int = 256
max_speech_len: int = 850 # truncates clips longer than ~8s
prompt_duration: float = 3.0 # reference audio duration (seconds)
Batch size guide by VRAM:
| VRAM | batch_size |
grad_accum |
Effective batch |
|---|---|---|---|
| 8 GB | 2 | 4 | 8 |
| 16 GB | 4 | 2 | 8 |
| 24 GB | 8 | 1 | 8 |
| 40 GB | 16 | 1 | 16 |
| 80 GB (H100) | 24 | 1 | 24 |
Step 6: Preprocess Dataset
Preprocessing encodes every audio clip into discrete speech tokens (S3 codes) and saves them as .pt files. This only needs to be run once — subsequent training runs skip it.
# First run — preprocessing is ON by default (preprocess=True in config)
python train.py
The preprocessor will:
- Load each WAV file from
wav_dir - Resample to 24000 Hz if needed
- Extract a 3-second voice conditioning prompt from the start
- Encode audio → S3 speech tokens using S3Gen
- Tokenize text → token IDs using the extended tokenizer
- Save each sample as a
.ptfile inpreprocessed_dir
Expected output:
Preprocessing sample 1/5000: bn_001 ...
Preprocessing sample 2/5000: bn_002 ...
...
Preprocessing complete. 4987 samples saved to ./MyTTSDataset/preprocess/
After preprocessing completes, set preprocess = False in src/config.py to skip it on future runs.
Step 7: Train the Model
# Make sure preprocess=False if you've already preprocessed
python train.py
What train.py does internally:
# 1. Load original Chatterbox T3 weights
tts = ChatterboxTTS.from_local(cfg.model_dir, device="cpu")
# 2. Create a new T3 with the extended Bangla vocab size
t3_cfg = tts.t3.hp
t3_cfg.text_tokens_dict_size = cfg.new_vocab_size # e.g. 4240
new_t3 = T3(hp=t3_cfg)
# 3. Transfer all original weights; randomly init only the new embedding rows
new_t3 = resize_and_load_t3_weights(new_t3, tts.t3.state_dict())
# 4. Freeze S3Gen and VoiceEncoder — only T3 trains
for param in tts.s3gen.parameters(): param.requires_grad = False
for param in tts.ve.parameters(): param.requires_grad = False
for param in new_t3.parameters(): param.requires_grad = True
# 5. HuggingFace Trainer with cosine LR, weight decay, gradient checkpointing
trainer = Trainer(
model=ChatterboxTrainerWrapper(new_t3),
args=TrainingArguments(
learning_rate=5e-6,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
weight_decay=0.01,
bf16=True,
gradient_checkpointing=True,
...
),
)
# 6. Auto-resume from latest checkpoint if one exists
trainer.train(resume_from_checkpoint=last_ckpt)
Monitoring training with TensorBoard:
tensorboard --logdir ./chatterbox_output
Training auto-resumes from the latest checkpoint in ./chatterbox_output/ if you stop and restart.
Step 8: Train on Modal (Cloud — Recommended)
For H100 training (~10× faster than a 24GB GPU):
Install Modal:
pip install modal
modal setup # opens browser for authentication
Create Modal volumes (one-time setup):
modal volume create xtts-finetune-data
modal volume create chatterbox-v2-output
Upload your dataset to Modal volume:
# Upload metadata CSV
modal volume put xtts-finetune-data ./MyTTSDataset/metadata.csv dataset/metadata_train.csv
# Upload WAV files (use a loop for large datasets)
modal volume put xtts-finetune-data ./MyTTSDataset/wavs/ dataset/wavs/
Configure train_modal.py:
vol = modal.Volume.from_name("xtts-finetune-data")
vol_out = modal.Volume.from_name("chatterbox-v2-output")
CSV_PATH = "/data/dataset/metadata_train.csv"
WAV_DIR = "/data/dataset/wavs"
Run training:
# Launch (detached — runs in background)
python -m modal run --detach train_modal.py
# Monitor logs
modal app logs <app-id>
Download a checkpoint after training:
# List available checkpoints
python -m modal volume ls chatterbox-v2-output
# Download a specific checkpoint's weights
python -m modal volume get chatterbox-v2-output \
checkpoint-456000/model.safetensors \
./chatterbox_output/checkpoint-456000_model.safetensors
H100 training speed reference:
| Dataset size | Batch | Steps/epoch | Time to 500k steps |
|---|---|---|---|
| 5h (~8k clips) | 24 | ~333 | ~12 hours |
| 10h (~16k clips) | 24 | ~667 | ~24 hours |
Step 9: Export Checkpoint for Inference
HuggingFace Trainer saves full checkpoints as checkpoint-XXXXXX/model.safetensors inside output_dir. These are the T3 weights wrapped with a t3. key prefix.
The inference script handles this automatically:
state_dict = load_file(weights_path, device="cpu")
# Strip HF Trainer wrapper prefix
if any(k.startswith("t3.") for k in state_dict):
state_dict = {k[len("t3."):]: v for k, v in state_dict.items() if k.startswith("t3.")}
new_t3.load_state_dict(state_dict, strict=True)
You can also flatten the checkpoint for distribution:
from safetensors.torch import load_file, save_file
state_dict = load_file("./chatterbox_output/checkpoint-456000/model.safetensors")
# Strip prefix
state_dict = {k[len("t3."):]: v for k, v in state_dict.items() if k.startswith("t3.")}
save_file(state_dict, "./t3_bangla_456k_clean.safetensors")
Hyperparameter Reference
Training
| Parameter | Value used | Notes |
|---|---|---|
learning_rate |
5e-6 |
Lower than typical LLM fine-tuning — T3 is sensitive |
lr_scheduler_type |
cosine |
Smooth decay, better than constant LR |
warmup_ratio |
0.05 |
5% of total steps as warmup |
weight_decay |
0.01 |
L2 regularization against overfitting |
bf16 |
True |
Faster on A100/H100; use fp16=True on older GPUs |
gradient_checkpointing |
True |
Saves ~40% VRAM at ~20% speed cost |
batch_size |
24 (H100) |
Scale down for smaller GPUs |
Inference
| Parameter | Recommended | Notes |
|---|---|---|
temperature |
0.3 |
Lower = more stable Bangla; higher = more expressive |
exaggeration |
0.5 |
Voice style intensity (0 = neutral, 1 = strong) |
cfg_weight |
0.5 |
Classifier-free guidance strength |
repetition_penalty |
1.2 |
Reduces token repetition loops |
min_new_tokens |
150 |
Prevents early truncation of speech |
Troubleshooting
Garbage audio / no speech from a later checkpoint:
- This is overfitting. Quality typically peaks around 400k–500k steps for a 5h dataset. Beyond that the model degrades.
- Use an earlier checkpoint (
checkpoint-456000recommended overcheckpoint-888000for this training run). - To prevent this: add
load_best_model_at_end=Truewith a validation split inTrainingArguments.
KeyError or size mismatch when loading weights:
- Ensure
new_vocab_sizeinsrc/config.pyexactly matches the number printed byadd_bangla_tokens.py. - If using HF Trainer checkpoint, make sure the
t3.prefix stripping code is applied.
Out of memory (OOM) during training:
- Reduce
batch_sizeby half and doublegrad_accumto keep effective batch size the same. - Enable
gradient_checkpointing=True(already on by default).
Preprocessing is very slow:
- Normal — encoding audio to S3 codes runs on CPU by default if no GPU is available.
- On GPU it takes ~1–2 hours for 10h of audio; on CPU expect 4–8 hours.
- Set
preprocess=Falseafter the first run to skip it.
Reference audio not matching voice:
- Reference audio must be clean and at least 3 seconds long.
- The speaker in the reference should ideally match the training speaker.
- Try recording a new reference with the same mic/conditions as your training data.
Training Repository
Full training code, scripts, and guides: github.com/EMTIAZZ/chatterbox-bangla-tts
License
This fine-tuned model follows the same license as the base Chatterbox TTS model (MIT). Please refer to Resemble AI's terms for commercial use.
Credits
- Resemble AI — Chatterbox TTS — base model and architecture
- Bangla adaptation and fine-tuning by @EMTIAZZ
- Trained on H100 via Modal
- Downloads last month
- 87