File size: 5,874 Bytes
13f85be
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
language:
  - en
  - hi
  - te
license: mit
library_name: chiluka
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - tts
  - styletts2
  - voice-cloning
  - multi-language
  - hindi
  - english
  - telugu
  - multi-speaker
  - style-transfer
---

# Chiluka TTS

**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on [StyleTTS2](https://github.com/yl4579/StyleTTS2).

It supports **style transfer from reference audio** - give it a voice sample and it will speak in that style.

## Available Models

| Model | Name | Languages | Speakers | Description |
|-------|------|-----------|----------|-------------|
| **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
| **Telugu** | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |

## Installation

```bash
pip install chiluka
```

Or from GitHub:

```bash
pip install git+https://github.com/PurviewVoiceBot/chiluka.git
```

**System dependency** (required for phonemization):

```bash
# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng
```

## Quick Start

```python
from chiluka import Chiluka

# Load model (weights download automatically on first use)
tts = Chiluka.from_pretrained()

# Synthesize speech
wav = tts.synthesize(
    text="Hello, this is Chiluka speaking!",
    reference_audio="path/to/reference.wav",
    language="en"
)

# Save output
tts.save_wav(wav, "output.wav")
```

## Choose a Model

```python
from chiluka import Chiluka

# Hindi + English (default)
tts = Chiluka.from_pretrained(model="hindi_english")

# Telugu + English
tts = Chiluka.from_pretrained(model="telugu")
```

## Hindi Example

```python
tts = Chiluka.from_pretrained()

wav = tts.synthesize(
    text="नमस्ते, मैं चिलुका बोल रहा हूं",
    reference_audio="reference.wav",
    language="hi"
)
tts.save_wav(wav, "hindi_output.wav")
```

## Telugu Example

```python
tts = Chiluka.from_pretrained(model="telugu")

wav = tts.synthesize(
    text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
    reference_audio="reference.wav",
    language="te"
)
tts.save_wav(wav, "telugu_output.wav")
```

## PyTorch Hub

```python
import torch

# Hindi-English (default)
tts = torch.hub.load('Seemanth/chiluka', 'chiluka')

# Telugu
tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')

wav = tts.synthesize("Hello!", "reference.wav", language="en")
```

## Synthesis Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `text` | required | Input text to synthesize |
| `reference_audio` | required | Path to reference audio for voice style |
| `language` | `"en"` | Language code (`en`, `hi`, `te`, etc.) |
| `alpha` | `0.3` | Acoustic style mixing (0 = reference voice, 1 = predicted) |
| `beta` | `0.7` | Prosodic style mixing (0 = reference prosody, 1 = predicted) |
| `diffusion_steps` | `5` | More steps = better quality, slower inference |
| `embedding_scale` | `1.0` | Classifier-free guidance strength |

## How It Works

Chiluka uses a StyleTTS2-based pipeline:

1. **Text** is converted to phonemes using espeak-ng
2. **PL-BERT** encodes text into contextual embeddings
3. **Reference audio** is processed to extract a style vector
4. **Diffusion model** samples a style conditioned on text
5. **Prosody predictor** generates duration, pitch (F0), and energy
6. **HiFi-GAN decoder** synthesizes the final waveform at 24kHz

## Model Architecture

- **Text Encoder**: Token embedding + CNN + BiLSTM
- **Style Encoder**: Conv2D + Residual blocks (style_dim=128)
- **Prosody Predictor**: LSTM-based with AdaIN normalization
- **Diffusion Model**: Transformer-based denoiser with ADPM2 sampler
- **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
- **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch)

## File Structure

```
├── configs/
│   ├── config_ft.yml                 # Telugu model config
│   └── config_hindi_english.yml      # Hindi-English model config
├── checkpoints/
│   ├── epoch_2nd_00017.pth           # Telugu checkpoint (~2GB)
│   └── epoch_2nd_00029.pth           # Hindi-English checkpoint (~2GB)
├── pretrained/                       # Shared pretrained sub-models
│   ├── ASR/                          # Text-to-mel alignment
│   ├── JDC/                          # Pitch extraction (F0)
│   └── PLBERT/                       # Text encoder
├── models/                           # Model architecture code
│   ├── core.py
│   ├── hifigan.py
│   └── diffusion/
├── inference.py                      # Main API
├── hub.py                            # HuggingFace Hub utilities
└── text_utils.py                     # Phoneme tokenization
```

## Requirements

- Python >= 3.8
- PyTorch >= 1.13.0
- CUDA recommended (works on CPU too)
- espeak-ng system package

## Limitations

- Requires a reference audio file for style/voice transfer
- Quality depends on the reference audio quality
- Best results with 3-15 second reference clips
- Hindi-English model trained on 5 speakers
- Telugu model trained on 1 speaker

## Citation

Based on StyleTTS2:

```bibtex
@inproceedings{li2024styletts,
  title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
  booktitle={NeurIPS},
  year={2024}
}
```

## License

MIT License

## Links

- **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
- **PyPI**: [chiluka](https://pypi.org/project/chiluka/)