File size: 5,874 Bytes
13f85be | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | ---
language:
- en
- hi
- te
license: mit
library_name: chiluka
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- styletts2
- voice-cloning
- multi-language
- hindi
- english
- telugu
- multi-speaker
- style-transfer
---
# Chiluka TTS
**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on [StyleTTS2](https://github.com/yl4579/StyleTTS2).
It supports **style transfer from reference audio** - give it a voice sample and it will speak in that style.
## Available Models
| Model | Name | Languages | Speakers | Description |
|-------|------|-----------|----------|-------------|
| **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
| **Telugu** | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |
## Installation
```bash
pip install chiluka
```
Or from GitHub:
```bash
pip install git+https://github.com/PurviewVoiceBot/chiluka.git
```
**System dependency** (required for phonemization):
```bash
# Ubuntu/Debian
sudo apt-get install espeak-ng
# macOS
brew install espeak-ng
```
## Quick Start
```python
from chiluka import Chiluka
# Load model (weights download automatically on first use)
tts = Chiluka.from_pretrained()
# Synthesize speech
wav = tts.synthesize(
text="Hello, this is Chiluka speaking!",
reference_audio="path/to/reference.wav",
language="en"
)
# Save output
tts.save_wav(wav, "output.wav")
```
## Choose a Model
```python
from chiluka import Chiluka
# Hindi + English (default)
tts = Chiluka.from_pretrained(model="hindi_english")
# Telugu + English
tts = Chiluka.from_pretrained(model="telugu")
```
## Hindi Example
```python
tts = Chiluka.from_pretrained()
wav = tts.synthesize(
text="नमस्ते, मैं चिलुका बोल रहा हूं",
reference_audio="reference.wav",
language="hi"
)
tts.save_wav(wav, "hindi_output.wav")
```
## Telugu Example
```python
tts = Chiluka.from_pretrained(model="telugu")
wav = tts.synthesize(
text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
reference_audio="reference.wav",
language="te"
)
tts.save_wav(wav, "telugu_output.wav")
```
## PyTorch Hub
```python
import torch
# Hindi-English (default)
tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
# Telugu
tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')
wav = tts.synthesize("Hello!", "reference.wav", language="en")
```
## Synthesis Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `text` | required | Input text to synthesize |
| `reference_audio` | required | Path to reference audio for voice style |
| `language` | `"en"` | Language code (`en`, `hi`, `te`, etc.) |
| `alpha` | `0.3` | Acoustic style mixing (0 = reference voice, 1 = predicted) |
| `beta` | `0.7` | Prosodic style mixing (0 = reference prosody, 1 = predicted) |
| `diffusion_steps` | `5` | More steps = better quality, slower inference |
| `embedding_scale` | `1.0` | Classifier-free guidance strength |
## How It Works
Chiluka uses a StyleTTS2-based pipeline:
1. **Text** is converted to phonemes using espeak-ng
2. **PL-BERT** encodes text into contextual embeddings
3. **Reference audio** is processed to extract a style vector
4. **Diffusion model** samples a style conditioned on text
5. **Prosody predictor** generates duration, pitch (F0), and energy
6. **HiFi-GAN decoder** synthesizes the final waveform at 24kHz
## Model Architecture
- **Text Encoder**: Token embedding + CNN + BiLSTM
- **Style Encoder**: Conv2D + Residual blocks (style_dim=128)
- **Prosody Predictor**: LSTM-based with AdaIN normalization
- **Diffusion Model**: Transformer-based denoiser with ADPM2 sampler
- **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
- **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch)
## File Structure
```
├── configs/
│ ├── config_ft.yml # Telugu model config
│ └── config_hindi_english.yml # Hindi-English model config
├── checkpoints/
│ ├── epoch_2nd_00017.pth # Telugu checkpoint (~2GB)
│ └── epoch_2nd_00029.pth # Hindi-English checkpoint (~2GB)
├── pretrained/ # Shared pretrained sub-models
│ ├── ASR/ # Text-to-mel alignment
│ ├── JDC/ # Pitch extraction (F0)
│ └── PLBERT/ # Text encoder
├── models/ # Model architecture code
│ ├── core.py
│ ├── hifigan.py
│ └── diffusion/
├── inference.py # Main API
├── hub.py # HuggingFace Hub utilities
└── text_utils.py # Phoneme tokenization
```
## Requirements
- Python >= 3.8
- PyTorch >= 1.13.0
- CUDA recommended (works on CPU too)
- espeak-ng system package
## Limitations
- Requires a reference audio file for style/voice transfer
- Quality depends on the reference audio quality
- Best results with 3-15 second reference clips
- Hindi-English model trained on 5 speakers
- Telugu model trained on 1 speaker
## Citation
Based on StyleTTS2:
```bibtex
@inproceedings{li2024styletts,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
booktitle={NeurIPS},
year={2024}
}
```
## License
MIT License
## Links
- **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
- **PyPI**: [chiluka](https://pypi.org/project/chiluka/)
|