File size: 4,369 Bytes
10ea2f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
393129e
10ea2f8
 
 
393129e
 
 
 
10ea2f8
 
 
 
 
 
393129e
 
10ea2f8
 
393129e
 
 
10ea2f8
 
 
 
393129e
10ea2f8
 
393129e
 
 
10ea2f8
 
 
393129e
10ea2f8
 
 
 
393129e
10ea2f8
 
 
 
 
 
 
 
 
 
 
393129e
10ea2f8
 
 
 
 
 
 
 
 
 
 
393129e
 
 
10ea2f8
 
393129e
10ea2f8
393129e
 
 
 
10ea2f8
393129e
 
 
10ea2f8
393129e
 
 
10ea2f8
 
393129e
10ea2f8
 
 
 
 
393129e
 
 
 
10ea2f8
 
393129e
10ea2f8
393129e
 
 
 
 
 
10ea2f8
393129e
10ea2f8
 
 
 
 
 
 
 
 
 
 
 
393129e
 
 
10ea2f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
393129e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
language:
  - en
  - hi
  - te
license: mit
library_name: chiluka
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - tts
  - styletts2
  - voice-cloning
  - multi-language
  - hindi
  - english
  - telugu
  - multi-speaker
  - style-transfer
---

# Chiluka TTS

**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight Text-to-Speech model based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) with style transfer from reference audio.

## Available Models

| Model | Name | Languages | Speakers |
|-------|------|-----------|----------|
| **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 |
| **Telugu** | `telugu` | Telugu, English | 1 |

## Installation

```bash
pip install git+https://github.com/PurviewVoiceBot/chiluka.git

# Required system dependency
sudo apt-get install espeak-ng    # Ubuntu/Debian
```

## Usage

Model weights download automatically on first use.

```python
from chiluka import Chiluka

# Load Hindi-English model (default)
tts = Chiluka.from_pretrained()

# Or Telugu model
# tts = Chiluka.from_pretrained(model="telugu")

wav = tts.synthesize(
    text="Hello, this is Chiluka speaking!",
    reference_audio="path/to/reference.wav",
    language="en-us"
)
tts.save_wav(wav, "output.wav")
```

### Hindi

```python
tts = Chiluka.from_pretrained()
wav = tts.synthesize(
    text="नमस्ते, मैं चिलुका बोल रहा हूं",
    reference_audio="reference.wav",
    language="hi"
)
tts.save_wav(wav, "hindi_output.wav")
```

### Telugu

```python
tts = Chiluka.from_pretrained(model="telugu")
wav = tts.synthesize(
    text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
    reference_audio="reference.wav",
    language="te"
)
tts.save_wav(wav, "telugu_output.wav")
```

## Streaming Audio

For WebRTC, WebSocket, or HTTP streaming:

```python
wav = tts.synthesize("Hello!", "reference.wav", language="en-us")

# Get audio as bytes (no disk write)
mp3_bytes = tts.to_audio_bytes(wav, format="mp3")    # requires pydub + ffmpeg
wav_bytes = tts.to_audio_bytes(wav, format="wav")
pcm_bytes = tts.to_audio_bytes(wav, format="pcm")    # raw 16-bit PCM

# Stream chunked audio
for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
    websocket.send(chunk)  # PCM chunks by default

# Stream as MP3 chunks
for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
    response.write(chunk)
```

## Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `text` | required | Input text to synthesize |
| `reference_audio` | required | Path to reference audio for voice style |
| `language` | `"en-us"` | espeak-ng language code (see below) |
| `alpha` | `0.3` | Acoustic style mixing (0 = reference, 1 = predicted) |
| `beta` | `0.7` | Prosodic style mixing (0 = reference, 1 = predicted) |
| `diffusion_steps` | `5` | More steps = better quality, slower |
| `embedding_scale` | `1.0` | Classifier-free guidance strength |

## Language Codes

| Language | Code | Available In |
|----------|------|-------------|
| English (US) | `en-us` | All models |
| English (UK) | `en-gb` | All models |
| Hindi | `hi` | `hindi_english` |
| Telugu | `te` | `telugu` |

## Architecture

- **Text Encoder**: Token embedding + CNN + BiLSTM
- **Style Encoder**: Conv2D + Residual blocks (style_dim=128)
- **Prosody Predictor**: LSTM-based with AdaIN normalization
- **Diffusion Model**: Transformer-based denoiser with ADPM2 sampler
- **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
- **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch)

## Requirements

- Python >= 3.8
- PyTorch >= 1.13.0
- CUDA recommended
- espeak-ng
- pydub + ffmpeg (only for MP3/OGG streaming)

## Citation

Based on StyleTTS2:

```bibtex
@inproceedings{li2024styletts,
  title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
  booktitle={NeurIPS},
  year={2024}
}
```

## License

MIT License

## Links

- **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
- **HuggingFace**: [Seemanth/chiluka](https://huggingface.co/Seemanth/chiluka)