File size: 4,369 Bytes
10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e 10ea2f8 393129e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
---
language:
- en
- hi
- te
license: mit
library_name: chiluka
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- styletts2
- voice-cloning
- multi-language
- hindi
- english
- telugu
- multi-speaker
- style-transfer
---
# Chiluka TTS
**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight Text-to-Speech model based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) with style transfer from reference audio.
## Available Models
| Model | Name | Languages | Speakers |
|-------|------|-----------|----------|
| **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 |
| **Telugu** | `telugu` | Telugu, English | 1 |
## Installation
```bash
pip install git+https://github.com/PurviewVoiceBot/chiluka.git
# Required system dependency
sudo apt-get install espeak-ng # Ubuntu/Debian
```
## Usage
Model weights download automatically on first use.
```python
from chiluka import Chiluka
# Load Hindi-English model (default)
tts = Chiluka.from_pretrained()
# Or Telugu model
# tts = Chiluka.from_pretrained(model="telugu")
wav = tts.synthesize(
text="Hello, this is Chiluka speaking!",
reference_audio="path/to/reference.wav",
language="en-us"
)
tts.save_wav(wav, "output.wav")
```
### Hindi
```python
tts = Chiluka.from_pretrained()
wav = tts.synthesize(
text="नमस्ते, मैं चिलुका बोल रहा हूं",
reference_audio="reference.wav",
language="hi"
)
tts.save_wav(wav, "hindi_output.wav")
```
### Telugu
```python
tts = Chiluka.from_pretrained(model="telugu")
wav = tts.synthesize(
text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
reference_audio="reference.wav",
language="te"
)
tts.save_wav(wav, "telugu_output.wav")
```
## Streaming Audio
For WebRTC, WebSocket, or HTTP streaming:
```python
wav = tts.synthesize("Hello!", "reference.wav", language="en-us")
# Get audio as bytes (no disk write)
mp3_bytes = tts.to_audio_bytes(wav, format="mp3") # requires pydub + ffmpeg
wav_bytes = tts.to_audio_bytes(wav, format="wav")
pcm_bytes = tts.to_audio_bytes(wav, format="pcm") # raw 16-bit PCM
# Stream chunked audio
for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
websocket.send(chunk) # PCM chunks by default
# Stream as MP3 chunks
for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
response.write(chunk)
```
## Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `text` | required | Input text to synthesize |
| `reference_audio` | required | Path to reference audio for voice style |
| `language` | `"en-us"` | espeak-ng language code (see below) |
| `alpha` | `0.3` | Acoustic style mixing (0 = reference, 1 = predicted) |
| `beta` | `0.7` | Prosodic style mixing (0 = reference, 1 = predicted) |
| `diffusion_steps` | `5` | More steps = better quality, slower |
| `embedding_scale` | `1.0` | Classifier-free guidance strength |
## Language Codes
| Language | Code | Available In |
|----------|------|-------------|
| English (US) | `en-us` | All models |
| English (UK) | `en-gb` | All models |
| Hindi | `hi` | `hindi_english` |
| Telugu | `te` | `telugu` |
## Architecture
- **Text Encoder**: Token embedding + CNN + BiLSTM
- **Style Encoder**: Conv2D + Residual blocks (style_dim=128)
- **Prosody Predictor**: LSTM-based with AdaIN normalization
- **Diffusion Model**: Transformer-based denoiser with ADPM2 sampler
- **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
- **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch)
## Requirements
- Python >= 3.8
- PyTorch >= 1.13.0
- CUDA recommended
- espeak-ng
- pydub + ffmpeg (only for MP3/OGG streaming)
## Citation
Based on StyleTTS2:
```bibtex
@inproceedings{li2024styletts,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
booktitle={NeurIPS},
year={2024}
}
```
## License
MIT License
## Links
- **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
- **HuggingFace**: [Seemanth/chiluka](https://huggingface.co/Seemanth/chiluka)
|