Chiluka TTS
Chiluka (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on StyleTTS2.
It supports style transfer from reference audio - give it a voice sample and it will speak in that style.
Available Models
| Model | Name | Languages | Speakers | Description |
|---|---|---|---|---|
| Hindi-English (default) | hindi_english |
Hindi, English | 5 | Multi-speaker Hindi + English TTS |
| Telugu | telugu |
Telugu, English | 1 | Single-speaker Telugu + English TTS |
Installation
pip install chiluka
Or from GitHub:
pip install git+https://github.com/PurviewVoiceBot/chiluka.git
System dependency (required for phonemization):
# Ubuntu/Debian
sudo apt-get install espeak-ng
# macOS
brew install espeak-ng
Quick Start
from chiluka import Chiluka
# Load model (weights download automatically on first use)
tts = Chiluka.from_pretrained()
# Synthesize speech
wav = tts.synthesize(
text="Hello, this is Chiluka speaking!",
reference_audio="path/to/reference.wav",
language="en"
)
# Save output
tts.save_wav(wav, "output.wav")
Choose a Model
from chiluka import Chiluka
# Hindi + English (default)
tts = Chiluka.from_pretrained(model="hindi_english")
# Telugu + English
tts = Chiluka.from_pretrained(model="telugu")
Hindi Example
tts = Chiluka.from_pretrained()
wav = tts.synthesize(
text="नमस्ते, मैं चिलुका बोल रहा हूं",
reference_audio="reference.wav",
language="hi"
)
tts.save_wav(wav, "hindi_output.wav")
Telugu Example
tts = Chiluka.from_pretrained(model="telugu")
wav = tts.synthesize(
text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
reference_audio="reference.wav",
language="te"
)
tts.save_wav(wav, "telugu_output.wav")
PyTorch Hub
import torch
# Hindi-English (default)
tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
# Telugu
tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')
wav = tts.synthesize("Hello!", "reference.wav", language="en")
Synthesis Parameters
| Parameter | Default | Description |
|---|---|---|
text |
required | Input text to synthesize |
reference_audio |
required | Path to reference audio for voice style |
language |
"en" |
Language code (en, hi, te, etc.) |
alpha |
0.3 |
Acoustic style mixing (0 = reference voice, 1 = predicted) |
beta |
0.7 |
Prosodic style mixing (0 = reference prosody, 1 = predicted) |
diffusion_steps |
5 |
More steps = better quality, slower inference |
embedding_scale |
1.0 |
Classifier-free guidance strength |
How It Works
Chiluka uses a StyleTTS2-based pipeline:
- Text is converted to phonemes using espeak-ng
- PL-BERT encodes text into contextual embeddings
- Reference audio is processed to extract a style vector
- Diffusion model samples a style conditioned on text
- Prosody predictor generates duration, pitch (F0), and energy
- HiFi-GAN decoder synthesizes the final waveform at 24kHz
Model Architecture
- Text Encoder: Token embedding + CNN + BiLSTM
- Style Encoder: Conv2D + Residual blocks (style_dim=128)
- Prosody Predictor: LSTM-based with AdaIN normalization
- Diffusion Model: Transformer-based denoiser with ADPM2 sampler
- Decoder: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
- Pretrained sub-models: PL-BERT (text), ASR (alignment), JDC (pitch)
File Structure
├── configs/
│ ├── config_ft.yml # Telugu model config
│ └── config_hindi_english.yml # Hindi-English model config
├── checkpoints/
│ ├── epoch_2nd_00017.pth # Telugu checkpoint (~2GB)
│ └── epoch_2nd_00029.pth # Hindi-English checkpoint (~2GB)
├── pretrained/ # Shared pretrained sub-models
│ ├── ASR/ # Text-to-mel alignment
│ ├── JDC/ # Pitch extraction (F0)
│ └── PLBERT/ # Text encoder
├── models/ # Model architecture code
│ ├── core.py
│ ├── hifigan.py
│ └── diffusion/
├── inference.py # Main API
├── hub.py # HuggingFace Hub utilities
└── text_utils.py # Phoneme tokenization
Requirements
- Python >= 3.8
- PyTorch >= 1.13.0
- CUDA recommended (works on CPU too)
- espeak-ng system package
Limitations
- Requires a reference audio file for style/voice transfer
- Quality depends on the reference audio quality
- Best results with 3-15 second reference clips
- Hindi-English model trained on 5 speakers
- Telugu model trained on 1 speaker
Citation
Based on StyleTTS2:
@inproceedings{li2024styletts,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
booktitle={NeurIPS},
year={2024}
}
License
MIT License
Links
- GitHub: PurviewVoiceBot/chiluka
- PyPI: chiluka