Chatterbox Turbo — Hindi/Hinglish Finetuned

Finetuned Chatterbox Turbo (350M, GPT-2 backbone) for Hindi (romanized) and English text-to-speech with voice cloning.

Key Features

  • Bilingual: Speaks both Hindi (romanized Latin script) and English
  • Hinglish: Handles code-mixed Hindi-English seamlessly
  • Voice Cloning: Provide any 5-10s reference audio to clone the voice
  • Fast: Single-step decoder, ~6x faster than real-time on GPU

How It Works

Hindi text is written in romanized form (Latin script), not Devanagari. This allows the GPT-2 BPE tokenizer to handle it natively without any vocabulary extension.

Example: "bharat ke kisan bahut mehnat karte hai" instead of "भारत के किसान बहुत मेहनत करते हैं"

Usage

Prerequisites

pip install chatterbox-tts safetensors torch torchaudio soundfile

Quick Inference

import torch
import soundfile as sf
from safetensors.torch import load_file
from chatterbox.tts_turbo import ChatterboxTurboTTS
from chatterbox.models.t3.t3 import T3

# Load base Chatterbox Turbo
engine = ChatterboxTurboTTS.from_pretrained(device="cuda")

# Load finetuned T3 weights
t3_config = engine.t3.hp
t3_config.text_tokens_dict_size = 50276
new_t3 = T3(hp=t3_config)
if hasattr(new_t3.tfmr, "wte"):
    del new_t3.tfmr.wte

state_dict = load_file("t3_turbo_finetuned.safetensors", device="cpu")
new_t3.load_state_dict(state_dict, strict=True)

engine.t3 = new_t3
engine.t3.to("cuda").eval()

# Generate speech
wav = engine.generate(
    text="yeh ek bahut acchi baat hai ki hum sab milkar kaam kar rahe hai.",
    audio_prompt_path="reference.wav",  # 5-10s reference clip of target voice
    temperature=0.5,
)
sf.write("output.wav", wav.squeeze().cpu().numpy(), 24000)

Text Format

  • Hindi: Use romanized text (Latin script). Example: "namaste, mera naam Ketav hai"
  • English: Use as-is. Example: "Hello, my name is Ketav"
  • Hinglish: Mix freely. Example: "mujhe lagta hai ki yeh project bahut successful hoga"

Romanization Guide

Common Hindi romanization patterns used in training:

Hindi Romanized
है hai
में mein
यह yeh
वो voh
नहीं nahi
बहुत bahut
क्योंकि kyonki

Inference Tips

  • Temperature 0.5 recommended (lower = more precise pronunciation)
  • Reference audio must be >5 seconds
  • Clean reference audio with minimal background noise works best

Training Details

Data

  • 14,085 samples (~20.4 hours) from a single male Hindi/English speaker
  • 7,320 Hindi samples (romanized via IndicXlit + loanword dictionary)
  • 6,765 English samples (original text)
  • Duration filtered to 1-15 seconds per clip

Text Processing Pipeline

  1. Indic Normalize (DevanagariNormalizer)
  2. English loanword replacement (23,019 entry dictionary)
  3. IndicXlit transliteration cache (235,973 entries)
  4. Lowercase + standardize romanization (62 rules)

Hyperparameters

  • Base model: ResembleAI/chatterbox-turbo
  • Vocab size: 50,276 (original GPT-2, no extension)
  • Batch size: 16, gradient accumulation: 2 (effective 32)
  • Learning rate: 5e-5
  • Epochs: 100
  • Best checkpoint: step 38,000, loss 0.6685
  • GPU: NVIDIA RTX 3090 (24GB)
  • Training time: ~14 hours

Loss Curve

Epoch Loss
1 7.204
10 3.672
20 2.162
30 1.519
50 0.938
80 0.669

Files

File Description
t3_turbo_finetuned.safetensors Finetuned T3 model weights (1.6 GB)
inference.py Inference script with test sentences
reference.wav Sample reference audio for voice cloning
config.py Training configuration used
TRAINING_NOTES.md Detailed training documentation

Limitations

  • Only handles romanized Hindi text, not Devanagari script
  • Voice quality depends on reference audio quality
  • May merge words at high temperature (use 0.5)
  • Trained on single male speaker — works for voice cloning of any voice, but Hindi pronunciation patterns are from one speaker

Acknowledgments

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ketav/chatterbox-turbo-hinglish

Finetuned
(6)
this model