--- language: - en - hi - te license: mit library_name: chiluka pipeline_tag: text-to-speech tags: - text-to-speech - tts - styletts2 - voice-cloning - multi-language - hindi - english - telugu - multi-speaker - style-transfer --- # Chiluka TTS **Chiluka** (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on [StyleTTS2](https://github.com/yl4579/StyleTTS2). It supports **style transfer from reference audio** - give it a voice sample and it will speak in that style. ## Available Models | Model | Name | Languages | Speakers | Description | |-------|------|-----------|----------|-------------| | **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS | | **Telugu** | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS | ## Installation ```bash pip install chiluka ``` Or from GitHub: ```bash pip install git+https://github.com/PurviewVoiceBot/chiluka.git ``` **System dependency** (required for phonemization): ```bash # Ubuntu/Debian sudo apt-get install espeak-ng # macOS brew install espeak-ng ``` ## Quick Start ```python from chiluka import Chiluka # Load model (weights download automatically on first use) tts = Chiluka.from_pretrained() # Synthesize speech wav = tts.synthesize( text="Hello, this is Chiluka speaking!", reference_audio="path/to/reference.wav", language="en" ) # Save output tts.save_wav(wav, "output.wav") ``` ## Choose a Model ```python from chiluka import Chiluka # Hindi + English (default) tts = Chiluka.from_pretrained(model="hindi_english") # Telugu + English tts = Chiluka.from_pretrained(model="telugu") ``` ## Hindi Example ```python tts = Chiluka.from_pretrained() wav = tts.synthesize( text="नमस्ते, मैं चिलुका बोल रहा हूं", reference_audio="reference.wav", language="hi" ) tts.save_wav(wav, "hindi_output.wav") ``` ## Telugu Example ```python tts = Chiluka.from_pretrained(model="telugu") wav = tts.synthesize( text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను", reference_audio="reference.wav", language="te" ) tts.save_wav(wav, "telugu_output.wav") ``` ## PyTorch Hub ```python import torch # Hindi-English (default) tts = torch.hub.load('Seemanth/chiluka', 'chiluka') # Telugu tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu') wav = tts.synthesize("Hello!", "reference.wav", language="en") ``` ## Synthesis Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `text` | required | Input text to synthesize | | `reference_audio` | required | Path to reference audio for voice style | | `language` | `"en"` | Language code (`en`, `hi`, `te`, etc.) | | `alpha` | `0.3` | Acoustic style mixing (0 = reference voice, 1 = predicted) | | `beta` | `0.7` | Prosodic style mixing (0 = reference prosody, 1 = predicted) | | `diffusion_steps` | `5` | More steps = better quality, slower inference | | `embedding_scale` | `1.0` | Classifier-free guidance strength | ## How It Works Chiluka uses a StyleTTS2-based pipeline: 1. **Text** is converted to phonemes using espeak-ng 2. **PL-BERT** encodes text into contextual embeddings 3. **Reference audio** is processed to extract a style vector 4. **Diffusion model** samples a style conditioned on text 5. **Prosody predictor** generates duration, pitch (F0), and energy 6. **HiFi-GAN decoder** synthesizes the final waveform at 24kHz ## Model Architecture - **Text Encoder**: Token embedding + CNN + BiLSTM - **Style Encoder**: Conv2D + Residual blocks (style_dim=128) - **Prosody Predictor**: LSTM-based with AdaIN normalization - **Diffusion Model**: Transformer-based denoiser with ADPM2 sampler - **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2) - **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch) ## File Structure ``` ├── configs/ │ ├── config_ft.yml # Telugu model config │ └── config_hindi_english.yml # Hindi-English model config ├── checkpoints/ │ ├── epoch_2nd_00017.pth # Telugu checkpoint (~2GB) │ └── epoch_2nd_00029.pth # Hindi-English checkpoint (~2GB) ├── pretrained/ # Shared pretrained sub-models │ ├── ASR/ # Text-to-mel alignment │ ├── JDC/ # Pitch extraction (F0) │ └── PLBERT/ # Text encoder ├── models/ # Model architecture code │ ├── core.py │ ├── hifigan.py │ └── diffusion/ ├── inference.py # Main API ├── hub.py # HuggingFace Hub utilities └── text_utils.py # Phoneme tokenization ``` ## Requirements - Python >= 3.8 - PyTorch >= 1.13.0 - CUDA recommended (works on CPU too) - espeak-ng system package ## Limitations - Requires a reference audio file for style/voice transfer - Quality depends on the reference audio quality - Best results with 3-15 second reference clips - Hindi-English model trained on 5 speakers - Telugu model trained on 1 speaker ## Citation Based on StyleTTS2: ```bibtex @inproceedings{li2024styletts, title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima}, booktitle={NeurIPS}, year={2024} } ``` ## License MIT License ## Links - **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka) - **PyPI**: [chiluka](https://pypi.org/project/chiluka/)