| # Chiluka | |
| **Chiluka** (చిలుక - Telugu for "parrot") is a lightweight TTS (Text-to-Speech) inference package based on StyleTTS2 with style transfer from reference audio. | |
| ## Available Models | |
| | Model | Name | Languages | Speakers | Description | | |
| |-------|------|-----------|----------|-------------| | |
| | Hindi-English (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS | | |
| | Telugu | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS | | |
| Model weights are hosted on [HuggingFace](https://huggingface.co/Seemanth/chiluka) and downloaded automatically on first use. | |
| ## Installation | |
| ```bash | |
| pip install git+https://github.com/PurviewVoiceBot/chiluka.git | |
| ``` | |
| System dependency (required): | |
| ```bash | |
| # Ubuntu/Debian | |
| sudo apt-get install espeak-ng | |
| # macOS | |
| brew install espeak-ng | |
| ``` | |
| ## Quick Start | |
| ```python | |
| from chiluka import Chiluka | |
| # Load Hindi-English model (default) | |
| tts = Chiluka.from_pretrained() | |
| # Synthesize speech | |
| wav = tts.synthesize( | |
| text="Hello, this is Chiluka speaking!", | |
| reference_audio="path/to/reference.wav", | |
| language="en-us" | |
| ) | |
| # Save to file | |
| tts.save_wav(wav, "output.wav") | |
| ``` | |
| ### Load a Specific Model | |
| ```python | |
| # Hindi-English (default) | |
| tts = Chiluka.from_pretrained(model="hindi_english") | |
| # Telugu | |
| tts = Chiluka.from_pretrained(model="telugu") | |
| ``` | |
| ## Examples | |
| ### Hindi | |
| ```python | |
| tts = Chiluka.from_pretrained() | |
| wav = tts.synthesize( | |
| text="नमस्ते, मैं चिलुका बोल रहा हूं", | |
| reference_audio="reference.wav", | |
| language="hi" | |
| ) | |
| tts.save_wav(wav, "hindi_output.wav") | |
| ``` | |
| ### English | |
| ```python | |
| wav = tts.synthesize( | |
| text="Hello, I am Chiluka, a text to speech system.", | |
| reference_audio="reference.wav", | |
| language="en-us" | |
| ) | |
| tts.save_wav(wav, "english_output.wav") | |
| ``` | |
| ### Telugu | |
| ```python | |
| tts = Chiluka.from_pretrained(model="telugu") | |
| wav = tts.synthesize( | |
| text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను", | |
| reference_audio="reference.wav", | |
| language="te" | |
| ) | |
| tts.save_wav(wav, "telugu_output.wav") | |
| ``` | |
| ## Streaming Audio | |
| For real-time applications (WebRTC, WebSocket, HTTP streaming), Chiluka can generate audio as bytes or chunked streams without writing to disk. | |
| ### Get Audio Bytes | |
| ```python | |
| wav = tts.synthesize("Hello!", "reference.wav", language="en-us") | |
| # WAV bytes | |
| wav_bytes = tts.to_audio_bytes(wav, format="wav") | |
| # MP3 bytes (requires: pip install pydub, and ffmpeg installed) | |
| mp3_bytes = tts.to_audio_bytes(wav, format="mp3") | |
| # Raw PCM bytes (16-bit signed int, for WebRTC) | |
| pcm_bytes = tts.to_audio_bytes(wav, format="pcm") | |
| # OGG bytes | |
| ogg_bytes = tts.to_audio_bytes(wav, format="ogg") | |
| ``` | |
| ### Stream Audio Chunks | |
| ```python | |
| # Stream PCM chunks over WebSocket | |
| for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"): | |
| websocket.send(chunk) | |
| # Stream MP3 chunks for HTTP response | |
| for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"): | |
| response.write(chunk) | |
| # Custom chunk size (default 4800 samples = 200ms at 24kHz) | |
| for chunk in tts.synthesize_stream("Hello!", "reference.wav", chunk_size=2400): | |
| process(chunk) | |
| ``` | |
| ## API Reference | |
| ### Chiluka.from_pretrained() | |
| ```python | |
| tts = Chiluka.from_pretrained( | |
| model="hindi_english", # "hindi_english" or "telugu" | |
| device="cuda", # "cuda" or "cpu" (auto-detects if None) | |
| force_download=False, # Re-download even if cached | |
| ) | |
| ``` | |
| ### synthesize() | |
| ```python | |
| wav = tts.synthesize( | |
| text="Hello world", # Text to synthesize | |
| reference_audio="ref.wav", # Reference audio for style | |
| language="en-us", # Language code | |
| alpha=0.3, # Acoustic style mixing (0-1) | |
| beta=0.7, # Prosodic style mixing (0-1) | |
| diffusion_steps=5, # Quality vs speed tradeoff | |
| embedding_scale=1.0, # Classifier-free guidance | |
| sr=24000 # Sample rate | |
| ) | |
| ``` | |
| ### to_audio_bytes() | |
| ```python | |
| audio_bytes = tts.to_audio_bytes( | |
| wav, # Numpy array from synthesize() | |
| format="mp3", # "wav", "mp3", "ogg", "flac", "pcm" | |
| sr=24000, # Sample rate | |
| bitrate="128k" # Bitrate for mp3/ogg | |
| ) | |
| ``` | |
| ### synthesize_stream() | |
| ```python | |
| for chunk in tts.synthesize_stream( | |
| text="Hello world", # Text to synthesize | |
| reference_audio="ref.wav", # Reference audio for style | |
| language="en-us", # Language code | |
| format="pcm", # "pcm", "wav", "mp3", "ogg" | |
| chunk_size=4800, # Samples per chunk (200ms at 24kHz) | |
| sr=24000, # Sample rate | |
| ): | |
| process(chunk) | |
| ``` | |
| ### Other Methods | |
| ```python | |
| tts.save_wav(wav, "output.wav") # Save to WAV file | |
| tts.play(wav) # Play via speakers (requires pyaudio) | |
| style = tts.compute_style("reference.wav") # Get style embedding | |
| ``` | |
| ## Synthesis Parameters | |
| | Parameter | Default | Description | | |
| |-----------|---------|-------------| | |
| | `alpha` | 0.3 | Acoustic style mixing (0=reference only, 1=predicted only) | | |
| | `beta` | 0.7 | Prosodic style mixing (0=reference only, 1=predicted only) | | |
| | `diffusion_steps` | 5 | Diffusion sampling steps (more = better quality, slower) | | |
| | `embedding_scale` | 1.0 | Classifier-free guidance scale | | |
| ## Language Codes | |
| These are espeak-ng language codes passed to the `language` parameter: | |
| | Language | Code | Available In | | |
| |----------|------|-------------| | |
| | English (US) | `en-us` | All models | | |
| | English (UK) | `en-gb` | All models | | |
| | Hindi | `hi` | `hindi_english` | | |
| | Telugu | `te` | `telugu` | | |
| ## Requirements | |
| - Python >= 3.8 | |
| - PyTorch >= 1.13.0 | |
| - CUDA (recommended) | |
| - espeak-ng | |
| - pydub + ffmpeg (only for MP3/OGG streaming) | |
| ## Credits | |
| Based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) by Yinghao Aaron Li et al. | |
| ## License | |
| MIT License | |