Chiluka TTS

Chiluka (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on StyleTTS2.

It supports style transfer from reference audio - give it a voice sample and it will speak in that style.

Available Models

Model	Name	Languages	Speakers	Description
Hindi-English (default)	`hindi_english`	Hindi, English	5	Multi-speaker Hindi + English TTS
Telugu	`telugu`	Telugu, English	1	Single-speaker Telugu + English TTS

Installation

pip install chiluka

Or from GitHub:

pip install git+https://github.com/PurviewVoiceBot/chiluka.git

System dependency (required for phonemization):

# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng

Quick Start

from chiluka import Chiluka

# Load model (weights download automatically on first use)
tts = Chiluka.from_pretrained()

# Synthesize speech
wav = tts.synthesize(
    text="Hello, this is Chiluka speaking!",
    reference_audio="path/to/reference.wav",
    language="en"
)

# Save output
tts.save_wav(wav, "output.wav")

Choose a Model

from chiluka import Chiluka

# Hindi + English (default)
tts = Chiluka.from_pretrained(model="hindi_english")

# Telugu + English
tts = Chiluka.from_pretrained(model="telugu")

Hindi Example

tts = Chiluka.from_pretrained()

wav = tts.synthesize(
    text="नमस्ते, मैं चिलुका बोल रहा हूं",
    reference_audio="reference.wav",
    language="hi"
)
tts.save_wav(wav, "hindi_output.wav")

Telugu Example

tts = Chiluka.from_pretrained(model="telugu")

wav = tts.synthesize(
    text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
    reference_audio="reference.wav",
    language="te"
)
tts.save_wav(wav, "telugu_output.wav")

PyTorch Hub

import torch

# Hindi-English (default)
tts = torch.hub.load('Seemanth/chiluka', 'chiluka')

# Telugu
tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')

wav = tts.synthesize("Hello!", "reference.wav", language="en")

Synthesis Parameters

Parameter	Default	Description
`text`	required	Input text to synthesize
`reference_audio`	required	Path to reference audio for voice style
`language`	`"en"`	Language code (`en`, `hi`, `te`, etc.)
`alpha`	`0.3`	Acoustic style mixing (0 = reference voice, 1 = predicted)
`beta`	`0.7`	Prosodic style mixing (0 = reference prosody, 1 = predicted)
`diffusion_steps`	`5`	More steps = better quality, slower inference
`embedding_scale`	`1.0`	Classifier-free guidance strength

How It Works

Chiluka uses a StyleTTS2-based pipeline:

Text is converted to phonemes using espeak-ng
PL-BERT encodes text into contextual embeddings
Reference audio is processed to extract a style vector
Diffusion model samples a style conditioned on text
Prosody predictor generates duration, pitch (F0), and energy
HiFi-GAN decoder synthesizes the final waveform at 24kHz

Model Architecture

Text Encoder: Token embedding + CNN + BiLSTM
Style Encoder: Conv2D + Residual blocks (style_dim=128)
Prosody Predictor: LSTM-based with AdaIN normalization
Diffusion Model: Transformer-based denoiser with ADPM2 sampler
Decoder: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
Pretrained sub-models: PL-BERT (text), ASR (alignment), JDC (pitch)

File Structure

├── configs/
│   ├── config_ft.yml                 # Telugu model config
│   └── config_hindi_english.yml      # Hindi-English model config
├── checkpoints/
│   ├── epoch_2nd_00017.pth           # Telugu checkpoint (~2GB)
│   └── epoch_2nd_00029.pth           # Hindi-English checkpoint (~2GB)
├── pretrained/                       # Shared pretrained sub-models
│   ├── ASR/                          # Text-to-mel alignment
│   ├── JDC/                          # Pitch extraction (F0)
│   └── PLBERT/                       # Text encoder
├── models/                           # Model architecture code
│   ├── core.py
│   ├── hifigan.py
│   └── diffusion/
├── inference.py                      # Main API
├── hub.py                            # HuggingFace Hub utilities
└── text_utils.py                     # Phoneme tokenization

Requirements

Python >= 3.8
PyTorch >= 1.13.0
CUDA recommended (works on CPU too)
espeak-ng system package

Limitations

Requires a reference audio file for style/voice transfer
Quality depends on the reference audio quality
Best results with 3-15 second reference clips
Hindi-English model trained on 5 speakers
Telugu model trained on 1 speaker

Citation

Based on StyleTTS2:

@inproceedings{li2024styletts,
  title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
  booktitle={NeurIPS},
  year={2024}
}

License

MIT License

Seemanth
/

chiluka-tts