File size: 6,036 Bytes
10ea2f8
7be9079
393129e
10ea2f8
 
 
 
 
 
 
7be9079
393129e
7be9079
393129e
7be9079
 
10ea2f8
 
7be9079
393129e
7be9079
 
10ea2f8
7be9079
 
10ea2f8
7be9079
 
 
 
 
 
 
 
10ea2f8
 
7be9079
 
 
 
 
393129e
7be9079
 
 
 
 
 
10ea2f8
7be9079
 
10ea2f8
 
7be9079
10ea2f8
 
 
 
 
 
393129e
7be9079
10ea2f8
393129e
10ea2f8
 
 
393129e
10ea2f8
 
 
7be9079
10ea2f8
393129e
10ea2f8
 
 
 
393129e
 
10ea2f8
 
 
 
393129e
10ea2f8
 
 
 
 
 
393129e
10ea2f8
 
 
 
 
393129e
 
 
 
 
10ea2f8
 
393129e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10ea2f8
393129e
 
 
7be9079
 
 
 
393129e
7be9079
 
10ea2f8
393129e
 
 
7be9079
 
 
 
 
 
 
 
 
393129e
7be9079
 
10ea2f8
 
7be9079
 
 
 
393129e
7be9079
 
393129e
 
 
 
 
 
 
7be9079
393129e
7be9079
393129e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7be9079
 
 
 
 
 
 
 
10ea2f8
7be9079
 
393129e
7be9079
393129e
10ea2f8
 
 
 
 
 
 
7be9079
 
 
 
 
393129e
7be9079
393129e
7be9079
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# Chiluka

**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight TTS (Text-to-Speech) inference package based on StyleTTS2 with style transfer from reference audio.

## Available Models

| Model | Name | Languages | Speakers | Description |
|-------|------|-----------|----------|-------------|
| Hindi-English (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
| Telugu | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |

Model weights are hosted on [HuggingFace](https://huggingface.co/Seemanth/chiluka) and downloaded automatically on first use.

## Installation

```bash
pip install git+https://github.com/PurviewVoiceBot/chiluka.git
```

System dependency (required):

```bash
# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng
```

## Quick Start

```python
from chiluka import Chiluka

# Load Hindi-English model (default)
tts = Chiluka.from_pretrained()

# Synthesize speech
wav = tts.synthesize(
    text="Hello, this is Chiluka speaking!",
    reference_audio="path/to/reference.wav",
    language="en-us"
)

# Save to file
tts.save_wav(wav, "output.wav")
```

### Load a Specific Model

```python
# Hindi-English (default)
tts = Chiluka.from_pretrained(model="hindi_english")

# Telugu
tts = Chiluka.from_pretrained(model="telugu")
```

## Examples

### Hindi

```python
tts = Chiluka.from_pretrained()

wav = tts.synthesize(
    text="नमस्ते, मैं चिलुका बोल रहा हूं",
    reference_audio="reference.wav",
    language="hi"
)
tts.save_wav(wav, "hindi_output.wav")
```

### English

```python
wav = tts.synthesize(
    text="Hello, I am Chiluka, a text to speech system.",
    reference_audio="reference.wav",
    language="en-us"
)
tts.save_wav(wav, "english_output.wav")
```

### Telugu

```python
tts = Chiluka.from_pretrained(model="telugu")

wav = tts.synthesize(
    text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
    reference_audio="reference.wav",
    language="te"
)
tts.save_wav(wav, "telugu_output.wav")
```

## Streaming Audio

For real-time applications (WebRTC, WebSocket, HTTP streaming), Chiluka can generate audio as bytes or chunked streams without writing to disk.

### Get Audio Bytes

```python
wav = tts.synthesize("Hello!", "reference.wav", language="en-us")

# WAV bytes
wav_bytes = tts.to_audio_bytes(wav, format="wav")

# MP3 bytes (requires: pip install pydub, and ffmpeg installed)
mp3_bytes = tts.to_audio_bytes(wav, format="mp3")

# Raw PCM bytes (16-bit signed int, for WebRTC)
pcm_bytes = tts.to_audio_bytes(wav, format="pcm")

# OGG bytes
ogg_bytes = tts.to_audio_bytes(wav, format="ogg")
```

### Stream Audio Chunks

```python
# Stream PCM chunks over WebSocket
for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
    websocket.send(chunk)

# Stream MP3 chunks for HTTP response
for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
    response.write(chunk)

# Custom chunk size (default 4800 samples = 200ms at 24kHz)
for chunk in tts.synthesize_stream("Hello!", "reference.wav", chunk_size=2400):
    process(chunk)
```

## API Reference

### Chiluka.from_pretrained()

```python
tts = Chiluka.from_pretrained(
    model="hindi_english",      # "hindi_english" or "telugu"
    device="cuda",              # "cuda" or "cpu" (auto-detects if None)
    force_download=False,       # Re-download even if cached
)
```

### synthesize()

```python
wav = tts.synthesize(
    text="Hello world",           # Text to synthesize
    reference_audio="ref.wav",    # Reference audio for style
    language="en-us",             # Language code
    alpha=0.3,                    # Acoustic style mixing (0-1)
    beta=0.7,                     # Prosodic style mixing (0-1)
    diffusion_steps=5,            # Quality vs speed tradeoff
    embedding_scale=1.0,          # Classifier-free guidance
    sr=24000                      # Sample rate
)
```

### to_audio_bytes()

```python
audio_bytes = tts.to_audio_bytes(
    wav,                          # Numpy array from synthesize()
    format="mp3",                 # "wav", "mp3", "ogg", "flac", "pcm"
    sr=24000,                     # Sample rate
    bitrate="128k"                # Bitrate for mp3/ogg
)
```

### synthesize_stream()

```python
for chunk in tts.synthesize_stream(
    text="Hello world",           # Text to synthesize
    reference_audio="ref.wav",    # Reference audio for style
    language="en-us",             # Language code
    format="pcm",                 # "pcm", "wav", "mp3", "ogg"
    chunk_size=4800,              # Samples per chunk (200ms at 24kHz)
    sr=24000,                     # Sample rate
):
    process(chunk)
```

### Other Methods

```python
tts.save_wav(wav, "output.wav")                 # Save to WAV file
tts.play(wav)                                   # Play via speakers (requires pyaudio)
style = tts.compute_style("reference.wav")      # Get style embedding
```

## Synthesis Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `alpha` | 0.3 | Acoustic style mixing (0=reference only, 1=predicted only) |
| `beta` | 0.7 | Prosodic style mixing (0=reference only, 1=predicted only) |
| `diffusion_steps` | 5 | Diffusion sampling steps (more = better quality, slower) |
| `embedding_scale` | 1.0 | Classifier-free guidance scale |

## Language Codes

These are espeak-ng language codes passed to the `language` parameter:

| Language | Code | Available In |
|----------|------|-------------|
| English (US) | `en-us` | All models |
| English (UK) | `en-gb` | All models |
| Hindi | `hi` | `hindi_english` |
| Telugu | `te` | `telugu` |

## Requirements

- Python >= 3.8
- PyTorch >= 1.13.0
- CUDA (recommended)
- espeak-ng
- pydub + ffmpeg (only for MP3/OGG streaming)

## Credits

Based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) by Yinghao Aaron Li et al.

## License

MIT License