metadata
language:
- en
- hi
- te
license: mit
library_name: chiluka
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- styletts2
- voice-cloning
- multi-language
- hindi
- english
- telugu
- multi-speaker
- style-transfer
Chiluka TTS
Chiluka (చిలుక - Telugu for "parrot") is a lightweight Text-to-Speech model based on StyleTTS2 with style transfer from reference audio.
Available Models
| Model | Name | Languages | Speakers |
|---|---|---|---|
| Hindi-English (default) | hindi_english |
Hindi, English | 5 |
| Telugu | telugu |
Telugu, English | 1 |
Installation
pip install git+https://github.com/PurviewVoiceBot/chiluka.git
# Required system dependency
sudo apt-get install espeak-ng # Ubuntu/Debian
Usage
Model weights download automatically on first use.
from chiluka import Chiluka
# Load Hindi-English model (default)
tts = Chiluka.from_pretrained()
# Or Telugu model
# tts = Chiluka.from_pretrained(model="telugu")
wav = tts.synthesize(
text="Hello, this is Chiluka speaking!",
reference_audio="path/to/reference.wav",
language="en-us"
)
tts.save_wav(wav, "output.wav")
Hindi
tts = Chiluka.from_pretrained()
wav = tts.synthesize(
text="नमस्ते, मैं चिलुका बोल रहा हूं",
reference_audio="reference.wav",
language="hi"
)
tts.save_wav(wav, "hindi_output.wav")
Telugu
tts = Chiluka.from_pretrained(model="telugu")
wav = tts.synthesize(
text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
reference_audio="reference.wav",
language="te"
)
tts.save_wav(wav, "telugu_output.wav")
Streaming Audio
For WebRTC, WebSocket, or HTTP streaming:
wav = tts.synthesize("Hello!", "reference.wav", language="en-us")
# Get audio as bytes (no disk write)
mp3_bytes = tts.to_audio_bytes(wav, format="mp3") # requires pydub + ffmpeg
wav_bytes = tts.to_audio_bytes(wav, format="wav")
pcm_bytes = tts.to_audio_bytes(wav, format="pcm") # raw 16-bit PCM
# Stream chunked audio
for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
websocket.send(chunk) # PCM chunks by default
# Stream as MP3 chunks
for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
response.write(chunk)
Parameters
| Parameter | Default | Description |
|---|---|---|
text |
required | Input text to synthesize |
reference_audio |
required | Path to reference audio for voice style |
language |
"en-us" |
espeak-ng language code (see below) |
alpha |
0.3 |
Acoustic style mixing (0 = reference, 1 = predicted) |
beta |
0.7 |
Prosodic style mixing (0 = reference, 1 = predicted) |
diffusion_steps |
5 |
More steps = better quality, slower |
embedding_scale |
1.0 |
Classifier-free guidance strength |
Language Codes
| Language | Code | Available In |
|---|---|---|
| English (US) | en-us |
All models |
| English (UK) | en-gb |
All models |
| Hindi | hi |
hindi_english |
| Telugu | te |
telugu |
Architecture
- Text Encoder: Token embedding + CNN + BiLSTM
- Style Encoder: Conv2D + Residual blocks (style_dim=128)
- Prosody Predictor: LSTM-based with AdaIN normalization
- Diffusion Model: Transformer-based denoiser with ADPM2 sampler
- Decoder: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
- Pretrained sub-models: PL-BERT (text), ASR (alignment), JDC (pitch)
Requirements
- Python >= 3.8
- PyTorch >= 1.13.0
- CUDA recommended
- espeak-ng
- pydub + ffmpeg (only for MP3/OGG streaming)
Citation
Based on StyleTTS2:
@inproceedings{li2024styletts,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
booktitle={NeurIPS},
year={2024}
}
License
MIT License
Links
- GitHub: PurviewVoiceBot/chiluka
- HuggingFace: Seemanth/chiluka