chiluka-tts / README.md
Seemanth's picture
Add Chiluka TTS models (Hindi-English + Telugu)
13f85be verified
---
language:
- en
- hi
- te
license: mit
library_name: chiluka
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- styletts2
- voice-cloning
- multi-language
- hindi
- english
- telugu
- multi-speaker
- style-transfer
---
# Chiluka TTS
**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on [StyleTTS2](https://github.com/yl4579/StyleTTS2).
It supports **style transfer from reference audio** - give it a voice sample and it will speak in that style.
## Available Models
| Model | Name | Languages | Speakers | Description |
|-------|------|-----------|----------|-------------|
| **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
| **Telugu** | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |
## Installation
```bash
pip install chiluka
```
Or from GitHub:
```bash
pip install git+https://github.com/PurviewVoiceBot/chiluka.git
```
**System dependency** (required for phonemization):
```bash
# Ubuntu/Debian
sudo apt-get install espeak-ng
# macOS
brew install espeak-ng
```
## Quick Start
```python
from chiluka import Chiluka
# Load model (weights download automatically on first use)
tts = Chiluka.from_pretrained()
# Synthesize speech
wav = tts.synthesize(
text="Hello, this is Chiluka speaking!",
reference_audio="path/to/reference.wav",
language="en"
)
# Save output
tts.save_wav(wav, "output.wav")
```
## Choose a Model
```python
from chiluka import Chiluka
# Hindi + English (default)
tts = Chiluka.from_pretrained(model="hindi_english")
# Telugu + English
tts = Chiluka.from_pretrained(model="telugu")
```
## Hindi Example
```python
tts = Chiluka.from_pretrained()
wav = tts.synthesize(
text="नमस्ते, मैं चिलुका बोल रहा हूं",
reference_audio="reference.wav",
language="hi"
)
tts.save_wav(wav, "hindi_output.wav")
```
## Telugu Example
```python
tts = Chiluka.from_pretrained(model="telugu")
wav = tts.synthesize(
text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
reference_audio="reference.wav",
language="te"
)
tts.save_wav(wav, "telugu_output.wav")
```
## PyTorch Hub
```python
import torch
# Hindi-English (default)
tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
# Telugu
tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')
wav = tts.synthesize("Hello!", "reference.wav", language="en")
```
## Synthesis Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `text` | required | Input text to synthesize |
| `reference_audio` | required | Path to reference audio for voice style |
| `language` | `"en"` | Language code (`en`, `hi`, `te`, etc.) |
| `alpha` | `0.3` | Acoustic style mixing (0 = reference voice, 1 = predicted) |
| `beta` | `0.7` | Prosodic style mixing (0 = reference prosody, 1 = predicted) |
| `diffusion_steps` | `5` | More steps = better quality, slower inference |
| `embedding_scale` | `1.0` | Classifier-free guidance strength |
## How It Works
Chiluka uses a StyleTTS2-based pipeline:
1. **Text** is converted to phonemes using espeak-ng
2. **PL-BERT** encodes text into contextual embeddings
3. **Reference audio** is processed to extract a style vector
4. **Diffusion model** samples a style conditioned on text
5. **Prosody predictor** generates duration, pitch (F0), and energy
6. **HiFi-GAN decoder** synthesizes the final waveform at 24kHz
## Model Architecture
- **Text Encoder**: Token embedding + CNN + BiLSTM
- **Style Encoder**: Conv2D + Residual blocks (style_dim=128)
- **Prosody Predictor**: LSTM-based with AdaIN normalization
- **Diffusion Model**: Transformer-based denoiser with ADPM2 sampler
- **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
- **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch)
## File Structure
```
├── configs/
│ ├── config_ft.yml # Telugu model config
│ └── config_hindi_english.yml # Hindi-English model config
├── checkpoints/
│ ├── epoch_2nd_00017.pth # Telugu checkpoint (~2GB)
│ └── epoch_2nd_00029.pth # Hindi-English checkpoint (~2GB)
├── pretrained/ # Shared pretrained sub-models
│ ├── ASR/ # Text-to-mel alignment
│ ├── JDC/ # Pitch extraction (F0)
│ └── PLBERT/ # Text encoder
├── models/ # Model architecture code
│ ├── core.py
│ ├── hifigan.py
│ └── diffusion/
├── inference.py # Main API
├── hub.py # HuggingFace Hub utilities
└── text_utils.py # Phoneme tokenization
```
## Requirements
- Python >= 3.8
- PyTorch >= 1.13.0
- CUDA recommended (works on CPU too)
- espeak-ng system package
## Limitations
- Requires a reference audio file for style/voice transfer
- Quality depends on the reference audio quality
- Best results with 3-15 second reference clips
- Hindi-English model trained on 5 speakers
- Telugu model trained on 1 speaker
## Citation
Based on StyleTTS2:
```bibtex
@inproceedings{li2024styletts,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
booktitle={NeurIPS},
year={2024}
}
```
## License
MIT License
## Links
- **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
- **PyPI**: [chiluka](https://pypi.org/project/chiluka/)