|
|
--- |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
- te |
|
|
license: mit |
|
|
library_name: chiluka |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- text-to-speech |
|
|
- tts |
|
|
- styletts2 |
|
|
- voice-cloning |
|
|
- multi-language |
|
|
- hindi |
|
|
- english |
|
|
- telugu |
|
|
- multi-speaker |
|
|
- style-transfer |
|
|
--- |
|
|
|
|
|
# Chiluka TTS |
|
|
|
|
|
**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight Text-to-Speech model based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) with style transfer from reference audio. |
|
|
|
|
|
## Available Models |
|
|
|
|
|
| Model | Name | Languages | Speakers | |
|
|
|-------|------|-----------|----------| |
|
|
| **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 | |
|
|
| **Telugu** | `telugu` | Telugu, English | 1 | |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/PurviewVoiceBot/chiluka.git |
|
|
|
|
|
# Required system dependency |
|
|
sudo apt-get install espeak-ng # Ubuntu/Debian |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
Model weights download automatically on first use. |
|
|
|
|
|
```python |
|
|
from chiluka import Chiluka |
|
|
|
|
|
# Load Hindi-English model (default) |
|
|
tts = Chiluka.from_pretrained() |
|
|
|
|
|
# Or Telugu model |
|
|
# tts = Chiluka.from_pretrained(model="telugu") |
|
|
|
|
|
wav = tts.synthesize( |
|
|
text="Hello, this is Chiluka speaking!", |
|
|
reference_audio="path/to/reference.wav", |
|
|
language="en-us" |
|
|
) |
|
|
tts.save_wav(wav, "output.wav") |
|
|
``` |
|
|
|
|
|
### Hindi |
|
|
|
|
|
```python |
|
|
tts = Chiluka.from_pretrained() |
|
|
wav = tts.synthesize( |
|
|
text="नमस्ते, मैं चिलुका बोल रहा हूं", |
|
|
reference_audio="reference.wav", |
|
|
language="hi" |
|
|
) |
|
|
tts.save_wav(wav, "hindi_output.wav") |
|
|
``` |
|
|
|
|
|
### Telugu |
|
|
|
|
|
```python |
|
|
tts = Chiluka.from_pretrained(model="telugu") |
|
|
wav = tts.synthesize( |
|
|
text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను", |
|
|
reference_audio="reference.wav", |
|
|
language="te" |
|
|
) |
|
|
tts.save_wav(wav, "telugu_output.wav") |
|
|
``` |
|
|
|
|
|
## Streaming Audio |
|
|
|
|
|
For WebRTC, WebSocket, or HTTP streaming: |
|
|
|
|
|
```python |
|
|
wav = tts.synthesize("Hello!", "reference.wav", language="en-us") |
|
|
|
|
|
# Get audio as bytes (no disk write) |
|
|
mp3_bytes = tts.to_audio_bytes(wav, format="mp3") # requires pydub + ffmpeg |
|
|
wav_bytes = tts.to_audio_bytes(wav, format="wav") |
|
|
pcm_bytes = tts.to_audio_bytes(wav, format="pcm") # raw 16-bit PCM |
|
|
|
|
|
# Stream chunked audio |
|
|
for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"): |
|
|
websocket.send(chunk) # PCM chunks by default |
|
|
|
|
|
# Stream as MP3 chunks |
|
|
for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"): |
|
|
response.write(chunk) |
|
|
``` |
|
|
|
|
|
## Parameters |
|
|
|
|
|
| Parameter | Default | Description | |
|
|
|-----------|---------|-------------| |
|
|
| `text` | required | Input text to synthesize | |
|
|
| `reference_audio` | required | Path to reference audio for voice style | |
|
|
| `language` | `"en-us"` | espeak-ng language code (see below) | |
|
|
| `alpha` | `0.3` | Acoustic style mixing (0 = reference, 1 = predicted) | |
|
|
| `beta` | `0.7` | Prosodic style mixing (0 = reference, 1 = predicted) | |
|
|
| `diffusion_steps` | `5` | More steps = better quality, slower | |
|
|
| `embedding_scale` | `1.0` | Classifier-free guidance strength | |
|
|
|
|
|
## Language Codes |
|
|
|
|
|
| Language | Code | Available In | |
|
|
|----------|------|-------------| |
|
|
| English (US) | `en-us` | All models | |
|
|
| English (UK) | `en-gb` | All models | |
|
|
| Hindi | `hi` | `hindi_english` | |
|
|
| Telugu | `te` | `telugu` | |
|
|
|
|
|
## Architecture |
|
|
|
|
|
- **Text Encoder**: Token embedding + CNN + BiLSTM |
|
|
- **Style Encoder**: Conv2D + Residual blocks (style_dim=128) |
|
|
- **Prosody Predictor**: LSTM-based with AdaIN normalization |
|
|
- **Diffusion Model**: Transformer-based denoiser with ADPM2 sampler |
|
|
- **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2) |
|
|
- **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch) |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- Python >= 3.8 |
|
|
- PyTorch >= 1.13.0 |
|
|
- CUDA recommended |
|
|
- espeak-ng |
|
|
- pydub + ffmpeg (only for MP3/OGG streaming) |
|
|
|
|
|
## Citation |
|
|
|
|
|
Based on StyleTTS2: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{li2024styletts, |
|
|
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, |
|
|
author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima}, |
|
|
booktitle={NeurIPS}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|
|
|
## Links |
|
|
|
|
|
- **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka) |
|
|
- **HuggingFace**: [Seemanth/chiluka](https://huggingface.co/Seemanth/chiluka) |
|
|
|