Hinglish TTS — sub-100M (89.96M)

A 89.96M-parameter fixed-voice Hindi+English (Hinglish) code-switch TTS, compressed from a 443M XTTS-v2 fine-tune down to under 100M while holding quality. On a held-out powered set (n=225) it is statistically at parity with its own 265M teacher on code-switch accent, and passes naturalness (UTMOS) and voice fidelity (SECS), with a lower runaway-generation rate.

It speaks 4 fixed voices (aadya, arjun, kaustubh, maya). It is not a zero-shot cloning model — dropping general speaker capacity is what makes <100M reachable for code-switch speech.

Lineage

model params repo
443M original Hinglish fine-tune 443M harrrshall/xtts-v2-hinglish-synthetic
265M distilled + RL 265M harrrshall/xtts-hinglish-265m
90M staged-prune + RFT (this model) 89.96M this repo

Model comparison (same held-out Hinglish set, n=225, same decode + scorer)

comparison

Model Params Accent ↑ SECS ↑ Tail ↓
XTTS-Hinglish-443M 443M 0.861 0.855 4.9%
XTTS-Hinglish-265M 265M 0.831 0.860 6.7%
XTTS-Hinglish-90M (this) 89.96M 0.820 0.851 4.4%
Kokoro-82M 82M 0.886* n/a** n/a

* Kokoro's accent (English-word recall) is favoured by its English-primary design. ** Kokoro uses its own single voice (no target-voice cloning) and is not code-switch tuned, so SECS does not apply. The point of this row: a generic 82M TTS does not deliver fixed-voice Hindi-English code-switch; this 90M model does, at the 265M teacher's quality.

Certification (held-out n=225, paired vs the 265M teacher, bootstrap 95% CI + TOST)

axis this 90M 265M teacher delta 95% CI
code-switch accent 0.820 0.831 -0.011 [-0.038, +0.016]
voice fidelity (SECS) 0.851 0.860 -0.009 [-0.014, -0.003]
runaway-tail rate 4.4% 6.7%

Accent is statistically even (delta -0.011) and SECS passes non-inferiority, with a lower failure tail than the teacher. A 3x smaller model at the same code-switch quality.

How it was built

  1. Structured width-prune, staged: d=1024 -> d=768 -> d=640, with a distillation-recovery pass between each cut. A one-shot d=1024 -> 640 cut failed (the model stopped following text); the staged route, with each student initialized from the recovered intermediate, reached parity. Heads 16 -> 10 (head_dim 64 kept), FFN 4096 -> 2560, 16 layers.
  2. Fixed-voice specialization: the speaker encoder + perceiver are dropped; 4 voices are baked (32x640 conditioning latents + a 512-d vocoder d-vector each). A learned 640 -> 1024 adapter feeds the frozen base XTTS HiFi-GAN.
  3. Multi-signal distillation from the 265M teacher (code CE + logit-KL + latent MSE/cos), then RFT on the model's own best rollouts to suppress the runaway tail and lock code-switch faithfulness.

Usage

pip install coqui-tts soundfile
python inference.py --voice maya --text "आज office में एक important meeting है तो मैं busy रहूँगा" --out out.wav

The frozen HiFi-GAN vocoder, tokenizer, and DVAE come from the public base XTTS-v2 (auto-downloaded by coqui-tts on first run). Only the 90M GPT + adapter + baked voices are in student640b_rft.pt.

Notes

  • Write Hindi in Devanagari, English in Latin, language tag "hi". Spell numbers as words.
  • Chunk text over ~150 characters.
  • Greedy decoding with repetition_penalty≈1.3 is the most faithful.

Files

  • student640b_rft.pt — the 90M GPT + 640->1024 adapter + 4 baked voices
  • student640.py — the model definition (structured slicing, Student640, build_student_gpt)
  • inference.py — self-contained inference

UTMOS is English-MOS-trained and used here only as a relative not-degraded-vs-teacher signal, not an absolute Hinglish naturalness score.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support