YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Chiluka
Chiluka (చిలుక - Telugu for "parrot") is a lightweight TTS (Text-to-Speech) inference package based on StyleTTS2 with style transfer from reference audio.
Available Models
| Model | Name | Languages | Speakers | Description |
|---|---|---|---|---|
| Hindi-English (default) | hindi_english |
Hindi, English | 5 | Multi-speaker Hindi + English TTS |
| Telugu | telugu |
Telugu, English | 1 | Single-speaker Telugu + English TTS |
Model weights are hosted on HuggingFace and downloaded automatically on first use.
Installation
pip install git+https://github.com/PurviewVoiceBot/chiluka.git
System dependency (required):
# Ubuntu/Debian
sudo apt-get install espeak-ng
# macOS
brew install espeak-ng
Quick Start
from chiluka import Chiluka
# Load Hindi-English model (default)
tts = Chiluka.from_pretrained()
# Synthesize speech
wav = tts.synthesize(
text="Hello, this is Chiluka speaking!",
reference_audio="path/to/reference.wav",
language="en-us"
)
# Save to file
tts.save_wav(wav, "output.wav")
Load a Specific Model
# Hindi-English (default)
tts = Chiluka.from_pretrained(model="hindi_english")
# Telugu
tts = Chiluka.from_pretrained(model="telugu")
Examples
Hindi
tts = Chiluka.from_pretrained()
wav = tts.synthesize(
text="नमस्ते, मैं चिलुका बोल रहा हूं",
reference_audio="reference.wav",
language="hi"
)
tts.save_wav(wav, "hindi_output.wav")
English
wav = tts.synthesize(
text="Hello, I am Chiluka, a text to speech system.",
reference_audio="reference.wav",
language="en-us"
)
tts.save_wav(wav, "english_output.wav")
Telugu
tts = Chiluka.from_pretrained(model="telugu")
wav = tts.synthesize(
text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
reference_audio="reference.wav",
language="te"
)
tts.save_wav(wav, "telugu_output.wav")
Streaming Audio
For real-time applications (WebRTC, WebSocket, HTTP streaming), Chiluka can generate audio as bytes or chunked streams without writing to disk.
Get Audio Bytes
wav = tts.synthesize("Hello!", "reference.wav", language="en-us")
# WAV bytes
wav_bytes = tts.to_audio_bytes(wav, format="wav")
# MP3 bytes (requires: pip install pydub, and ffmpeg installed)
mp3_bytes = tts.to_audio_bytes(wav, format="mp3")
# Raw PCM bytes (16-bit signed int, for WebRTC)
pcm_bytes = tts.to_audio_bytes(wav, format="pcm")
# OGG bytes
ogg_bytes = tts.to_audio_bytes(wav, format="ogg")
Stream Audio Chunks
# Stream PCM chunks over WebSocket
for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
websocket.send(chunk)
# Stream MP3 chunks for HTTP response
for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
response.write(chunk)
# Custom chunk size (default 4800 samples = 200ms at 24kHz)
for chunk in tts.synthesize_stream("Hello!", "reference.wav", chunk_size=2400):
process(chunk)
API Reference
Chiluka.from_pretrained()
tts = Chiluka.from_pretrained(
model="hindi_english", # "hindi_english" or "telugu"
device="cuda", # "cuda" or "cpu" (auto-detects if None)
force_download=False, # Re-download even if cached
)
synthesize()
wav = tts.synthesize(
text="Hello world", # Text to synthesize
reference_audio="ref.wav", # Reference audio for style
language="en-us", # Language code
alpha=0.3, # Acoustic style mixing (0-1)
beta=0.7, # Prosodic style mixing (0-1)
diffusion_steps=5, # Quality vs speed tradeoff
embedding_scale=1.0, # Classifier-free guidance
sr=24000 # Sample rate
)
to_audio_bytes()
audio_bytes = tts.to_audio_bytes(
wav, # Numpy array from synthesize()
format="mp3", # "wav", "mp3", "ogg", "flac", "pcm"
sr=24000, # Sample rate
bitrate="128k" # Bitrate for mp3/ogg
)
synthesize_stream()
for chunk in tts.synthesize_stream(
text="Hello world", # Text to synthesize
reference_audio="ref.wav", # Reference audio for style
language="en-us", # Language code
format="pcm", # "pcm", "wav", "mp3", "ogg"
chunk_size=4800, # Samples per chunk (200ms at 24kHz)
sr=24000, # Sample rate
):
process(chunk)
Other Methods
tts.save_wav(wav, "output.wav") # Save to WAV file
tts.play(wav) # Play via speakers (requires pyaudio)
style = tts.compute_style("reference.wav") # Get style embedding
Synthesis Parameters
| Parameter | Default | Description |
|---|---|---|
alpha |
0.3 | Acoustic style mixing (0=reference only, 1=predicted only) |
beta |
0.7 | Prosodic style mixing (0=reference only, 1=predicted only) |
diffusion_steps |
5 | Diffusion sampling steps (more = better quality, slower) |
embedding_scale |
1.0 | Classifier-free guidance scale |
Language Codes
These are espeak-ng language codes passed to the language parameter:
| Language | Code | Available In |
|---|---|---|
| English (US) | en-us |
All models |
| English (UK) | en-gb |
All models |
| Hindi | hi |
hindi_english |
| Telugu | te |
telugu |
Requirements
- Python >= 3.8
- PyTorch >= 1.13.0
- CUDA (recommended)
- espeak-ng
- pydub + ffmpeg (only for MP3/OGG streaming)
Credits
Based on StyleTTS2 by Yinghao Aaron Li et al.
License
MIT License
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support