|
|
--- |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
- te |
|
|
license: mit |
|
|
library_name: chiluka |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- text-to-speech |
|
|
- tts |
|
|
- styletts2 |
|
|
- voice-cloning |
|
|
- multi-language |
|
|
- hindi |
|
|
- english |
|
|
- telugu |
|
|
- multi-speaker |
|
|
- style-transfer |
|
|
--- |
|
|
|
|
|
# Chiluka TTS |
|
|
|
|
|
**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on [StyleTTS2](https://github.com/yl4579/StyleTTS2). |
|
|
|
|
|
It supports **style transfer from reference audio** - give it a voice sample and it will speak in that style. |
|
|
|
|
|
## Available Models |
|
|
|
|
|
| Model | Name | Languages | Speakers | Description | |
|
|
|-------|------|-----------|----------|-------------| |
|
|
| **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS | |
|
|
| **Telugu** | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS | |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install chiluka |
|
|
``` |
|
|
|
|
|
Or from GitHub: |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/PurviewVoiceBot/chiluka.git |
|
|
``` |
|
|
|
|
|
**System dependency** (required for phonemization): |
|
|
|
|
|
```bash |
|
|
# Ubuntu/Debian |
|
|
sudo apt-get install espeak-ng |
|
|
|
|
|
# macOS |
|
|
brew install espeak-ng |
|
|
``` |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
from chiluka import Chiluka |
|
|
|
|
|
# Load model (weights download automatically on first use) |
|
|
tts = Chiluka.from_pretrained() |
|
|
|
|
|
# Synthesize speech |
|
|
wav = tts.synthesize( |
|
|
text="Hello, this is Chiluka speaking!", |
|
|
reference_audio="path/to/reference.wav", |
|
|
language="en" |
|
|
) |
|
|
|
|
|
# Save output |
|
|
tts.save_wav(wav, "output.wav") |
|
|
``` |
|
|
|
|
|
## Choose a Model |
|
|
|
|
|
```python |
|
|
from chiluka import Chiluka |
|
|
|
|
|
# Hindi + English (default) |
|
|
tts = Chiluka.from_pretrained(model="hindi_english") |
|
|
|
|
|
# Telugu + English |
|
|
tts = Chiluka.from_pretrained(model="telugu") |
|
|
``` |
|
|
|
|
|
## Hindi Example |
|
|
|
|
|
```python |
|
|
tts = Chiluka.from_pretrained() |
|
|
|
|
|
wav = tts.synthesize( |
|
|
text="नमस्ते, मैं चिलुका बोल रहा हूं", |
|
|
reference_audio="reference.wav", |
|
|
language="hi" |
|
|
) |
|
|
tts.save_wav(wav, "hindi_output.wav") |
|
|
``` |
|
|
|
|
|
## Telugu Example |
|
|
|
|
|
```python |
|
|
tts = Chiluka.from_pretrained(model="telugu") |
|
|
|
|
|
wav = tts.synthesize( |
|
|
text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను", |
|
|
reference_audio="reference.wav", |
|
|
language="te" |
|
|
) |
|
|
tts.save_wav(wav, "telugu_output.wav") |
|
|
``` |
|
|
|
|
|
## PyTorch Hub |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
# Hindi-English (default) |
|
|
tts = torch.hub.load('Seemanth/chiluka', 'chiluka') |
|
|
|
|
|
# Telugu |
|
|
tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu') |
|
|
|
|
|
wav = tts.synthesize("Hello!", "reference.wav", language="en") |
|
|
``` |
|
|
|
|
|
## Synthesis Parameters |
|
|
|
|
|
| Parameter | Default | Description | |
|
|
|-----------|---------|-------------| |
|
|
| `text` | required | Input text to synthesize | |
|
|
| `reference_audio` | required | Path to reference audio for voice style | |
|
|
| `language` | `"en"` | Language code (`en`, `hi`, `te`, etc.) | |
|
|
| `alpha` | `0.3` | Acoustic style mixing (0 = reference voice, 1 = predicted) | |
|
|
| `beta` | `0.7` | Prosodic style mixing (0 = reference prosody, 1 = predicted) | |
|
|
| `diffusion_steps` | `5` | More steps = better quality, slower inference | |
|
|
| `embedding_scale` | `1.0` | Classifier-free guidance strength | |
|
|
|
|
|
## How It Works |
|
|
|
|
|
Chiluka uses a StyleTTS2-based pipeline: |
|
|
|
|
|
1. **Text** is converted to phonemes using espeak-ng |
|
|
2. **PL-BERT** encodes text into contextual embeddings |
|
|
3. **Reference audio** is processed to extract a style vector |
|
|
4. **Diffusion model** samples a style conditioned on text |
|
|
5. **Prosody predictor** generates duration, pitch (F0), and energy |
|
|
6. **HiFi-GAN decoder** synthesizes the final waveform at 24kHz |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Text Encoder**: Token embedding + CNN + BiLSTM |
|
|
- **Style Encoder**: Conv2D + Residual blocks (style_dim=128) |
|
|
- **Prosody Predictor**: LSTM-based with AdaIN normalization |
|
|
- **Diffusion Model**: Transformer-based denoiser with ADPM2 sampler |
|
|
- **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2) |
|
|
- **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch) |
|
|
|
|
|
## File Structure |
|
|
|
|
|
``` |
|
|
├── configs/ |
|
|
│ ├── config_ft.yml # Telugu model config |
|
|
│ └── config_hindi_english.yml # Hindi-English model config |
|
|
├── checkpoints/ |
|
|
│ ├── epoch_2nd_00017.pth # Telugu checkpoint (~2GB) |
|
|
│ └── epoch_2nd_00029.pth # Hindi-English checkpoint (~2GB) |
|
|
├── pretrained/ # Shared pretrained sub-models |
|
|
│ ├── ASR/ # Text-to-mel alignment |
|
|
│ ├── JDC/ # Pitch extraction (F0) |
|
|
│ └── PLBERT/ # Text encoder |
|
|
├── models/ # Model architecture code |
|
|
│ ├── core.py |
|
|
│ ├── hifigan.py |
|
|
│ └── diffusion/ |
|
|
├── inference.py # Main API |
|
|
├── hub.py # HuggingFace Hub utilities |
|
|
└── text_utils.py # Phoneme tokenization |
|
|
``` |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- Python >= 3.8 |
|
|
- PyTorch >= 1.13.0 |
|
|
- CUDA recommended (works on CPU too) |
|
|
- espeak-ng system package |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Requires a reference audio file for style/voice transfer |
|
|
- Quality depends on the reference audio quality |
|
|
- Best results with 3-15 second reference clips |
|
|
- Hindi-English model trained on 5 speakers |
|
|
- Telugu model trained on 1 speaker |
|
|
|
|
|
## Citation |
|
|
|
|
|
Based on StyleTTS2: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{li2024styletts, |
|
|
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, |
|
|
author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima}, |
|
|
booktitle={NeurIPS}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|
|
|
## Links |
|
|
|
|
|
- **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka) |
|
|
- **PyPI**: [chiluka](https://pypi.org/project/chiluka/) |
|
|
|