chiluka / README.md
seemanthraju
Added streaming funciton
393129e

Chiluka

Chiluka (చిలుక - Telugu for "parrot") is a lightweight TTS (Text-to-Speech) inference package based on StyleTTS2 with style transfer from reference audio.

Available Models

Model Name Languages Speakers Description
Hindi-English (default) hindi_english Hindi, English 5 Multi-speaker Hindi + English TTS
Telugu telugu Telugu, English 1 Single-speaker Telugu + English TTS

Model weights are hosted on HuggingFace and downloaded automatically on first use.

Installation

pip install git+https://github.com/PurviewVoiceBot/chiluka.git

System dependency (required):

# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng

Quick Start

from chiluka import Chiluka

# Load Hindi-English model (default)
tts = Chiluka.from_pretrained()

# Synthesize speech
wav = tts.synthesize(
    text="Hello, this is Chiluka speaking!",
    reference_audio="path/to/reference.wav",
    language="en-us"
)

# Save to file
tts.save_wav(wav, "output.wav")

Load a Specific Model

# Hindi-English (default)
tts = Chiluka.from_pretrained(model="hindi_english")

# Telugu
tts = Chiluka.from_pretrained(model="telugu")

Examples

Hindi

tts = Chiluka.from_pretrained()

wav = tts.synthesize(
    text="नमस्ते, मैं चिलुका बोल रहा हूं",
    reference_audio="reference.wav",
    language="hi"
)
tts.save_wav(wav, "hindi_output.wav")

English

wav = tts.synthesize(
    text="Hello, I am Chiluka, a text to speech system.",
    reference_audio="reference.wav",
    language="en-us"
)
tts.save_wav(wav, "english_output.wav")

Telugu

tts = Chiluka.from_pretrained(model="telugu")

wav = tts.synthesize(
    text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
    reference_audio="reference.wav",
    language="te"
)
tts.save_wav(wav, "telugu_output.wav")

Streaming Audio

For real-time applications (WebRTC, WebSocket, HTTP streaming), Chiluka can generate audio as bytes or chunked streams without writing to disk.

Get Audio Bytes

wav = tts.synthesize("Hello!", "reference.wav", language="en-us")

# WAV bytes
wav_bytes = tts.to_audio_bytes(wav, format="wav")

# MP3 bytes (requires: pip install pydub, and ffmpeg installed)
mp3_bytes = tts.to_audio_bytes(wav, format="mp3")

# Raw PCM bytes (16-bit signed int, for WebRTC)
pcm_bytes = tts.to_audio_bytes(wav, format="pcm")

# OGG bytes
ogg_bytes = tts.to_audio_bytes(wav, format="ogg")

Stream Audio Chunks

# Stream PCM chunks over WebSocket
for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
    websocket.send(chunk)

# Stream MP3 chunks for HTTP response
for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
    response.write(chunk)

# Custom chunk size (default 4800 samples = 200ms at 24kHz)
for chunk in tts.synthesize_stream("Hello!", "reference.wav", chunk_size=2400):
    process(chunk)

API Reference

Chiluka.from_pretrained()

tts = Chiluka.from_pretrained(
    model="hindi_english",      # "hindi_english" or "telugu"
    device="cuda",              # "cuda" or "cpu" (auto-detects if None)
    force_download=False,       # Re-download even if cached
)

synthesize()

wav = tts.synthesize(
    text="Hello world",           # Text to synthesize
    reference_audio="ref.wav",    # Reference audio for style
    language="en-us",             # Language code
    alpha=0.3,                    # Acoustic style mixing (0-1)
    beta=0.7,                     # Prosodic style mixing (0-1)
    diffusion_steps=5,            # Quality vs speed tradeoff
    embedding_scale=1.0,          # Classifier-free guidance
    sr=24000                      # Sample rate
)

to_audio_bytes()

audio_bytes = tts.to_audio_bytes(
    wav,                          # Numpy array from synthesize()
    format="mp3",                 # "wav", "mp3", "ogg", "flac", "pcm"
    sr=24000,                     # Sample rate
    bitrate="128k"                # Bitrate for mp3/ogg
)

synthesize_stream()

for chunk in tts.synthesize_stream(
    text="Hello world",           # Text to synthesize
    reference_audio="ref.wav",    # Reference audio for style
    language="en-us",             # Language code
    format="pcm",                 # "pcm", "wav", "mp3", "ogg"
    chunk_size=4800,              # Samples per chunk (200ms at 24kHz)
    sr=24000,                     # Sample rate
):
    process(chunk)

Other Methods

tts.save_wav(wav, "output.wav")                 # Save to WAV file
tts.play(wav)                                   # Play via speakers (requires pyaudio)
style = tts.compute_style("reference.wav")      # Get style embedding

Synthesis Parameters

Parameter Default Description
alpha 0.3 Acoustic style mixing (0=reference only, 1=predicted only)
beta 0.7 Prosodic style mixing (0=reference only, 1=predicted only)
diffusion_steps 5 Diffusion sampling steps (more = better quality, slower)
embedding_scale 1.0 Classifier-free guidance scale

Language Codes

These are espeak-ng language codes passed to the language parameter:

Language Code Available In
English (US) en-us All models
English (UK) en-gb All models
Hindi hi hindi_english
Telugu te telugu

Requirements

  • Python >= 3.8
  • PyTorch >= 1.13.0
  • CUDA (recommended)
  • espeak-ng
  • pydub + ffmpeg (only for MP3/OGG streaming)

Credits

Based on StyleTTS2 by Yinghao Aaron Li et al.

License

MIT License