Chiluka TTS

Chiluka (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on StyleTTS2.

It supports style transfer from reference audio - give it a voice sample and it will speak in that style.

Available Models

Model Name Languages Speakers Description
Hindi-English (default) hindi_english Hindi, English 5 Multi-speaker Hindi + English TTS
Telugu telugu Telugu, English 1 Single-speaker Telugu + English TTS

Installation

pip install chiluka

Or from GitHub:

pip install git+https://github.com/PurviewVoiceBot/chiluka.git

System dependency (required for phonemization):

# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng

Quick Start

from chiluka import Chiluka

# Load model (weights download automatically on first use)
tts = Chiluka.from_pretrained()

# Synthesize speech
wav = tts.synthesize(
    text="Hello, this is Chiluka speaking!",
    reference_audio="path/to/reference.wav",
    language="en"
)

# Save output
tts.save_wav(wav, "output.wav")

Choose a Model

from chiluka import Chiluka

# Hindi + English (default)
tts = Chiluka.from_pretrained(model="hindi_english")

# Telugu + English
tts = Chiluka.from_pretrained(model="telugu")

Hindi Example

tts = Chiluka.from_pretrained()

wav = tts.synthesize(
    text="नमस्ते, मैं चिलुका बोल रहा हूं",
    reference_audio="reference.wav",
    language="hi"
)
tts.save_wav(wav, "hindi_output.wav")

Telugu Example

tts = Chiluka.from_pretrained(model="telugu")

wav = tts.synthesize(
    text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
    reference_audio="reference.wav",
    language="te"
)
tts.save_wav(wav, "telugu_output.wav")

PyTorch Hub

import torch

# Hindi-English (default)
tts = torch.hub.load('Seemanth/chiluka', 'chiluka')

# Telugu
tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')

wav = tts.synthesize("Hello!", "reference.wav", language="en")

Synthesis Parameters

Parameter Default Description
text required Input text to synthesize
reference_audio required Path to reference audio for voice style
language "en" Language code (en, hi, te, etc.)
alpha 0.3 Acoustic style mixing (0 = reference voice, 1 = predicted)
beta 0.7 Prosodic style mixing (0 = reference prosody, 1 = predicted)
diffusion_steps 5 More steps = better quality, slower inference
embedding_scale 1.0 Classifier-free guidance strength

How It Works

Chiluka uses a StyleTTS2-based pipeline:

  1. Text is converted to phonemes using espeak-ng
  2. PL-BERT encodes text into contextual embeddings
  3. Reference audio is processed to extract a style vector
  4. Diffusion model samples a style conditioned on text
  5. Prosody predictor generates duration, pitch (F0), and energy
  6. HiFi-GAN decoder synthesizes the final waveform at 24kHz

Model Architecture

  • Text Encoder: Token embedding + CNN + BiLSTM
  • Style Encoder: Conv2D + Residual blocks (style_dim=128)
  • Prosody Predictor: LSTM-based with AdaIN normalization
  • Diffusion Model: Transformer-based denoiser with ADPM2 sampler
  • Decoder: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
  • Pretrained sub-models: PL-BERT (text), ASR (alignment), JDC (pitch)

File Structure

├── configs/
│   ├── config_ft.yml                 # Telugu model config
│   └── config_hindi_english.yml      # Hindi-English model config
├── checkpoints/
│   ├── epoch_2nd_00017.pth           # Telugu checkpoint (~2GB)
│   └── epoch_2nd_00029.pth           # Hindi-English checkpoint (~2GB)
├── pretrained/                       # Shared pretrained sub-models
│   ├── ASR/                          # Text-to-mel alignment
│   ├── JDC/                          # Pitch extraction (F0)
│   └── PLBERT/                       # Text encoder
├── models/                           # Model architecture code
│   ├── core.py
│   ├── hifigan.py
│   └── diffusion/
├── inference.py                      # Main API
├── hub.py                            # HuggingFace Hub utilities
└── text_utils.py                     # Phoneme tokenization

Requirements

  • Python >= 3.8
  • PyTorch >= 1.13.0
  • CUDA recommended (works on CPU too)
  • espeak-ng system package

Limitations

  • Requires a reference audio file for style/voice transfer
  • Quality depends on the reference audio quality
  • Best results with 3-15 second reference clips
  • Hindi-English model trained on 5 speakers
  • Telugu model trained on 1 speaker

Citation

Based on StyleTTS2:

@inproceedings{li2024styletts,
  title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
  booktitle={NeurIPS},
  year={2024}
}

License

MIT License

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Seemanth/chiluka-tts 1