Chatterbox Turbo — Hindi/Hinglish Finetuned
Finetuned Chatterbox Turbo (350M, GPT-2 backbone) for Hindi (romanized) and English text-to-speech with voice cloning.
Key Features
- Bilingual: Speaks both Hindi (romanized Latin script) and English
- Hinglish: Handles code-mixed Hindi-English seamlessly
- Voice Cloning: Provide any 5-10s reference audio to clone the voice
- Fast: Single-step decoder, ~6x faster than real-time on GPU
How It Works
Hindi text is written in romanized form (Latin script), not Devanagari. This allows the GPT-2 BPE tokenizer to handle it natively without any vocabulary extension.
Example: "bharat ke kisan bahut mehnat karte hai" instead of "भारत के किसान बहुत मेहनत करते हैं"
Usage
Prerequisites
pip install chatterbox-tts safetensors torch torchaudio soundfile
Quick Inference
import torch
import soundfile as sf
from safetensors.torch import load_file
from chatterbox.tts_turbo import ChatterboxTurboTTS
from chatterbox.models.t3.t3 import T3
# Load base Chatterbox Turbo
engine = ChatterboxTurboTTS.from_pretrained(device="cuda")
# Load finetuned T3 weights
t3_config = engine.t3.hp
t3_config.text_tokens_dict_size = 50276
new_t3 = T3(hp=t3_config)
if hasattr(new_t3.tfmr, "wte"):
del new_t3.tfmr.wte
state_dict = load_file("t3_turbo_finetuned.safetensors", device="cpu")
new_t3.load_state_dict(state_dict, strict=True)
engine.t3 = new_t3
engine.t3.to("cuda").eval()
# Generate speech
wav = engine.generate(
text="yeh ek bahut acchi baat hai ki hum sab milkar kaam kar rahe hai.",
audio_prompt_path="reference.wav", # 5-10s reference clip of target voice
temperature=0.5,
)
sf.write("output.wav", wav.squeeze().cpu().numpy(), 24000)
Text Format
- Hindi: Use romanized text (Latin script). Example:
"namaste, mera naam Ketav hai" - English: Use as-is. Example:
"Hello, my name is Ketav" - Hinglish: Mix freely. Example:
"mujhe lagta hai ki yeh project bahut successful hoga"
Romanization Guide
Common Hindi romanization patterns used in training:
| Hindi | Romanized |
|---|---|
| है | hai |
| में | mein |
| यह | yeh |
| वो | voh |
| नहीं | nahi |
| बहुत | bahut |
| क्योंकि | kyonki |
Inference Tips
- Temperature 0.5 recommended (lower = more precise pronunciation)
- Reference audio must be >5 seconds
- Clean reference audio with minimal background noise works best
Training Details
Data
- 14,085 samples (~20.4 hours) from a single male Hindi/English speaker
- 7,320 Hindi samples (romanized via IndicXlit + loanword dictionary)
- 6,765 English samples (original text)
- Duration filtered to 1-15 seconds per clip
Text Processing Pipeline
- Indic Normalize (DevanagariNormalizer)
- English loanword replacement (23,019 entry dictionary)
- IndicXlit transliteration cache (235,973 entries)
- Lowercase + standardize romanization (62 rules)
Hyperparameters
- Base model:
ResembleAI/chatterbox-turbo - Vocab size: 50,276 (original GPT-2, no extension)
- Batch size: 16, gradient accumulation: 2 (effective 32)
- Learning rate: 5e-5
- Epochs: 100
- Best checkpoint: step 38,000, loss 0.6685
- GPU: NVIDIA RTX 3090 (24GB)
- Training time: ~14 hours
Loss Curve
| Epoch | Loss |
|---|---|
| 1 | 7.204 |
| 10 | 3.672 |
| 20 | 2.162 |
| 30 | 1.519 |
| 50 | 0.938 |
| 80 | 0.669 |
Files
| File | Description |
|---|---|
t3_turbo_finetuned.safetensors |
Finetuned T3 model weights (1.6 GB) |
inference.py |
Inference script with test sentences |
reference.wav |
Sample reference audio for voice cloning |
config.py |
Training configuration used |
TRAINING_NOTES.md |
Detailed training documentation |
Limitations
- Only handles romanized Hindi text, not Devanagari script
- Voice quality depends on reference audio quality
- May merge words at high temperature (use 0.5)
- Trained on single male speaker — works for voice cloning of any voice, but Hindi pronunciation patterns are from one speaker
Acknowledgments
- Resemble AI for Chatterbox Turbo
- gokhaneraslan/chatterbox-finetuning for the finetuning toolkit
- AI4Bharat IndicXlit for transliteration
- Downloads last month
- -
Model tree for ketav/chatterbox-turbo-hinglish
Base model
ResembleAI/chatterbox-turbo