README.md · Seemanth/chiluka-tts at main

File size: 5,874 Bytes

13f85be

---
language:
  - en
  - hi
  - te
license: mit
library_name: chiluka
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - tts
  - styletts2
  - voice-cloning
  - multi-language
  - hindi
  - english
  - telugu
  - multi-speaker
  - style-transfer
---

# Chiluka TTS

**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on [StyleTTS2](https://github.com/yl4579/StyleTTS2).

It supports **style transfer from reference audio** - give it a voice sample and it will speak in that style.

## Available Models

| Model | Name | Languages | Speakers | Description |
|-------|------|-----------|----------|-------------|
| **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
| **Telugu** | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |

## Installation

```bash
pip install chiluka
```

Or from GitHub:

```bash
pip install git+https://github.com/PurviewVoiceBot/chiluka.git
```

**System dependency** (required for phonemization):

```bash
# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng
```

## Quick Start

```python
from chiluka import Chiluka

# Load model (weights download automatically on first use)
tts = Chiluka.from_pretrained()

# Synthesize speech
wav = tts.synthesize(
    text="Hello, this is Chiluka speaking!",
    reference_audio="path/to/reference.wav",
    language="en"
)

# Save output
tts.save_wav(wav, "output.wav")
```

## Choose a Model

```python
from chiluka import Chiluka

# Hindi + English (default)
tts = Chiluka.from_pretrained(model="hindi_english")

# Telugu + English
tts = Chiluka.from_pretrained(model="telugu")
```

## Hindi Example

```python
tts = Chiluka.from_pretrained()

wav = tts.synthesize(
    text="नमस्ते, मैं चिलुका बोल रहा हूं",
    reference_audio="reference.wav",
    language="hi"
)
tts.save_wav(wav, "hindi_output.wav")
```

## Telugu Example

```python
tts = Chiluka.from_pretrained(model="telugu")

wav = tts.synthesize(
    text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
    reference_audio="reference.wav",
    language="te"
)
tts.save_wav(wav, "telugu_output.wav")
```

## PyTorch Hub

```python
import torch

# Hindi-English (default)
tts = torch.hub.load('Seemanth/chiluka', 'chiluka')

# Telugu
tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')

wav = tts.synthesize("Hello!", "reference.wav", language="en")
```

## Synthesis Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `text` | required | Input text to synthesize |
| `reference_audio` | required | Path to reference audio for voice style |
| `language` | `"en"` | Language code (`en`, `hi`, `te`, etc.) |
| `alpha` | `0.3` | Acoustic style mixing (0 = reference voice, 1 = predicted) |
| `beta` | `0.7` | Prosodic style mixing (0 = reference prosody, 1 = predicted) |
| `diffusion_steps` | `5` | More steps = better quality, slower inference |
| `embedding_scale` | `1.0` | Classifier-free guidance strength |

## How It Works

Chiluka uses a StyleTTS2-based pipeline:

1. **Text** is converted to phonemes using espeak-ng
2. **PL-BERT** encodes text into contextual embeddings
3. **Reference audio** is processed to extract a style vector
4. **Diffusion model** samples a style conditioned on text
5. **Prosody predictor** generates duration, pitch (F0), and energy
6. **HiFi-GAN decoder** synthesizes the final waveform at 24kHz

## Model Architecture

- **Text Encoder**: Token embedding + CNN + BiLSTM
- **Style Encoder**: Conv2D + Residual blocks (style_dim=128)
- **Prosody Predictor**: LSTM-based with AdaIN normalization
- **Diffusion Model**: Transformer-based denoiser with ADPM2 sampler
- **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
- **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch)

## File Structure

```
├── configs/
│   ├── config_ft.yml                 # Telugu model config
│   └── config_hindi_english.yml      # Hindi-English model config
├── checkpoints/
│   ├── epoch_2nd_00017.pth           # Telugu checkpoint (~2GB)
│   └── epoch_2nd_00029.pth           # Hindi-English checkpoint (~2GB)
├── pretrained/                       # Shared pretrained sub-models
│   ├── ASR/                          # Text-to-mel alignment
│   ├── JDC/                          # Pitch extraction (F0)
│   └── PLBERT/                       # Text encoder
├── models/                           # Model architecture code
│   ├── core.py
│   ├── hifigan.py
│   └── diffusion/
├── inference.py                      # Main API
├── hub.py                            # HuggingFace Hub utilities
└── text_utils.py                     # Phoneme tokenization
```

## Requirements

- Python >= 3.8
- PyTorch >= 1.13.0
- CUDA recommended (works on CPU too)
- espeak-ng system package

## Limitations

- Requires a reference audio file for style/voice transfer
- Quality depends on the reference audio quality
- Best results with 3-15 second reference clips
- Hindi-English model trained on 5 speakers
- Telugu model trained on 1 speaker

## Citation

Based on StyleTTS2:

```bibtex
@inproceedings{li2024styletts,
  title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
  booktitle={NeurIPS},
  year={2024}
}
```

## License

MIT License

## Links

- **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
- **PyPI**: [chiluka](https://pypi.org/project/chiluka/)