# Chiluka **Chiluka** (చిలుక - Telugu for "parrot") is a lightweight TTS (Text-to-Speech) inference package based on StyleTTS2 with style transfer from reference audio. ## Available Models | Model | Name | Languages | Speakers | Description | |-------|------|-----------|----------|-------------| | Hindi-English (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS | | Telugu | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS | Model weights are hosted on [HuggingFace](https://huggingface.co/Seemanth/chiluka) and downloaded automatically on first use. ## Installation ```bash pip install git+https://github.com/PurviewVoiceBot/chiluka.git ``` System dependency (required): ```bash # Ubuntu/Debian sudo apt-get install espeak-ng # macOS brew install espeak-ng ``` ## Quick Start ```python from chiluka import Chiluka # Load Hindi-English model (default) tts = Chiluka.from_pretrained() # Synthesize speech wav = tts.synthesize( text="Hello, this is Chiluka speaking!", reference_audio="path/to/reference.wav", language="en-us" ) # Save to file tts.save_wav(wav, "output.wav") ``` ### Load a Specific Model ```python # Hindi-English (default) tts = Chiluka.from_pretrained(model="hindi_english") # Telugu tts = Chiluka.from_pretrained(model="telugu") ``` ## Examples ### Hindi ```python tts = Chiluka.from_pretrained() wav = tts.synthesize( text="नमस्ते, मैं चिलुका बोल रहा हूं", reference_audio="reference.wav", language="hi" ) tts.save_wav(wav, "hindi_output.wav") ``` ### English ```python wav = tts.synthesize( text="Hello, I am Chiluka, a text to speech system.", reference_audio="reference.wav", language="en-us" ) tts.save_wav(wav, "english_output.wav") ``` ### Telugu ```python tts = Chiluka.from_pretrained(model="telugu") wav = tts.synthesize( text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను", reference_audio="reference.wav", language="te" ) tts.save_wav(wav, "telugu_output.wav") ``` ## Streaming Audio For real-time applications (WebRTC, WebSocket, HTTP streaming), Chiluka can generate audio as bytes or chunked streams without writing to disk. ### Get Audio Bytes ```python wav = tts.synthesize("Hello!", "reference.wav", language="en-us") # WAV bytes wav_bytes = tts.to_audio_bytes(wav, format="wav") # MP3 bytes (requires: pip install pydub, and ffmpeg installed) mp3_bytes = tts.to_audio_bytes(wav, format="mp3") # Raw PCM bytes (16-bit signed int, for WebRTC) pcm_bytes = tts.to_audio_bytes(wav, format="pcm") # OGG bytes ogg_bytes = tts.to_audio_bytes(wav, format="ogg") ``` ### Stream Audio Chunks ```python # Stream PCM chunks over WebSocket for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"): websocket.send(chunk) # Stream MP3 chunks for HTTP response for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"): response.write(chunk) # Custom chunk size (default 4800 samples = 200ms at 24kHz) for chunk in tts.synthesize_stream("Hello!", "reference.wav", chunk_size=2400): process(chunk) ``` ## API Reference ### Chiluka.from_pretrained() ```python tts = Chiluka.from_pretrained( model="hindi_english", # "hindi_english" or "telugu" device="cuda", # "cuda" or "cpu" (auto-detects if None) force_download=False, # Re-download even if cached ) ``` ### synthesize() ```python wav = tts.synthesize( text="Hello world", # Text to synthesize reference_audio="ref.wav", # Reference audio for style language="en-us", # Language code alpha=0.3, # Acoustic style mixing (0-1) beta=0.7, # Prosodic style mixing (0-1) diffusion_steps=5, # Quality vs speed tradeoff embedding_scale=1.0, # Classifier-free guidance sr=24000 # Sample rate ) ``` ### to_audio_bytes() ```python audio_bytes = tts.to_audio_bytes( wav, # Numpy array from synthesize() format="mp3", # "wav", "mp3", "ogg", "flac", "pcm" sr=24000, # Sample rate bitrate="128k" # Bitrate for mp3/ogg ) ``` ### synthesize_stream() ```python for chunk in tts.synthesize_stream( text="Hello world", # Text to synthesize reference_audio="ref.wav", # Reference audio for style language="en-us", # Language code format="pcm", # "pcm", "wav", "mp3", "ogg" chunk_size=4800, # Samples per chunk (200ms at 24kHz) sr=24000, # Sample rate ): process(chunk) ``` ### Other Methods ```python tts.save_wav(wav, "output.wav") # Save to WAV file tts.play(wav) # Play via speakers (requires pyaudio) style = tts.compute_style("reference.wav") # Get style embedding ``` ## Synthesis Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `alpha` | 0.3 | Acoustic style mixing (0=reference only, 1=predicted only) | | `beta` | 0.7 | Prosodic style mixing (0=reference only, 1=predicted only) | | `diffusion_steps` | 5 | Diffusion sampling steps (more = better quality, slower) | | `embedding_scale` | 1.0 | Classifier-free guidance scale | ## Language Codes These are espeak-ng language codes passed to the `language` parameter: | Language | Code | Available In | |----------|------|-------------| | English (US) | `en-us` | All models | | English (UK) | `en-gb` | All models | | Hindi | `hi` | `hindi_english` | | Telugu | `te` | `telugu` | ## Requirements - Python >= 3.8 - PyTorch >= 1.13.0 - CUDA (recommended) - espeak-ng - pydub + ffmpeg (only for MP3/OGG streaming) ## Credits Based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) by Yinghao Aaron Li et al. ## License MIT License