Chatterbox Turbo — Hindi/Hinglish Finetuned

Finetuned Chatterbox Turbo (350M, GPT-2 backbone) for Hindi (romanized) and English text-to-speech with voice cloning.

Key Features

Bilingual: Speaks both Hindi (romanized Latin script) and English
Hinglish: Handles code-mixed Hindi-English seamlessly
Voice Cloning: Provide any 5-10s reference audio to clone the voice
Fast: Single-step decoder, ~6x faster than real-time on GPU

How It Works

Hindi text is written in romanized form (Latin script), not Devanagari. This allows the GPT-2 BPE tokenizer to handle it natively without any vocabulary extension.

Example: "bharat ke kisan bahut mehnat karte hai" instead of "भारत के किसान बहुत मेहनत करते हैं"

Usage

Prerequisites

pip install chatterbox-tts safetensors torch torchaudio soundfile

Quick Inference

import torch
import soundfile as sf
from safetensors.torch import load_file
from chatterbox.tts_turbo import ChatterboxTurboTTS
from chatterbox.models.t3.t3 import T3

# Load base Chatterbox Turbo
engine = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Load finetuned T3 weights
t3_config = engine.t3.hp
t3_config.text_tokens_dict_size = 50276
new_t3 = T3(hp=t3_config)
if hasattr(new_t3.tfmr, "wte"):
    del new_t3.tfmr.wte

state_dict = load_file("t3_turbo_finetuned.safetensors", device="cpu")
new_t3.load_state_dict(state_dict, strict=True)

engine.t3 = new_t3
engine.t3.to("cuda").eval()

# Generate speech
wav = engine.generate(
    text="yeh ek bahut acchi baat hai ki hum sab milkar kaam kar rahe hai.",
    audio_prompt_path="reference.wav",  # 5-10s reference clip of target voice
    temperature=0.5,
)
sf.write("output.wav", wav.squeeze().cpu().numpy(), 24000)

Text Format

Hindi: Use romanized text (Latin script). Example: "namaste, mera naam Ketav hai"
English: Use as-is. Example: "Hello, my name is Ketav"
Hinglish: Mix freely. Example: "mujhe lagta hai ki yeh project bahut successful hoga"

Romanization Guide

Common Hindi romanization patterns used in training:

Hindi	Romanized
है	hai
में	mein
यह	yeh
वो	voh
नहीं	nahi
बहुत	bahut
क्योंकि	kyonki

Inference Tips

Temperature 0.5 recommended (lower = more precise pronunciation)
Reference audio must be >5 seconds
Clean reference audio with minimal background noise works best

Training Details

Data

14,085 samples (~20.4 hours) from a single male Hindi/English speaker
7,320 Hindi samples (romanized via IndicXlit + loanword dictionary)
6,765 English samples (original text)
Duration filtered to 1-15 seconds per clip

Text Processing Pipeline

Indic Normalize (DevanagariNormalizer)
English loanword replacement (23,019 entry dictionary)
IndicXlit transliteration cache (235,973 entries)
Lowercase + standardize romanization (62 rules)

Hyperparameters

Base model: ResembleAI/chatterbox-turbo
Vocab size: 50,276 (original GPT-2, no extension)
Batch size: 16, gradient accumulation: 2 (effective 32)
Learning rate: 5e-5
Epochs: 100
Best checkpoint: step 38,000, loss 0.6685
GPU: NVIDIA RTX 3090 (24GB)
Training time: ~14 hours

Loss Curve

Epoch	Loss
1	7.204
10	3.672
20	2.162
30	1.519
50	0.938
80	0.669

Files

File	Description
`t3_turbo_finetuned.safetensors`	Finetuned T3 model weights (1.6 GB)
`inference.py`	Inference script with test sentences
`reference.wav`	Sample reference audio for voice cloning
`config.py`	Training configuration used
`TRAINING_NOTES.md`	Detailed training documentation

Limitations

Only handles romanized Hindi text, not Devanagari script
Voice quality depends on reference audio quality
May merge words at high temperature (use 0.5)
Trained on single male speaker — works for voice cloning of any voice, but Hindi pronunciation patterns are from one speaker

Acknowledgments

Resemble AI for Chatterbox Turbo
gokhaneraslan/chatterbox-finetuning for the finetuning toolkit
AI4Bharat IndicXlit for transliteration

Downloads last month: -

Model tree for ketav/chatterbox-turbo-hinglish

Base model

ResembleAI/chatterbox-turbo

Finetuned

(7)

this model