README.md · Seemanth/chiluka-tts at main

chiluka-tts / README.md

Seemanth

Add Chiluka TTS models (Hindi-English + Telugu)

13f85be verified 12 days ago

preview code

raw

history blame contribute delete

5.87 kB

	---
	language:
	- en
	- hi
	- te
	license: mit
	library_name: chiluka
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- tts
	- styletts2
	- voice-cloning
	- multi-language
	- hindi
	- english
	- telugu
	- multi-speaker
	- style-transfer
	---

	# Chiluka TTS

	Chiluka (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on [StyleTTS2](https://github.com/yl4579/StyleTTS2).

	It supports style transfer from reference audio - give it a voice sample and it will speak in that style.

	## Available Models

	\| Model \| Name \| Languages \| Speakers \| Description \|
	\|-------\|------\|-----------\|----------\|-------------\|
	\| Hindi-English (default) \| `hindi_english` \| Hindi, English \| 5 \| Multi-speaker Hindi + English TTS \|
	\| Telugu \| `telugu` \| Telugu, English \| 1 \| Single-speaker Telugu + English TTS \|

	## Installation

	```bash
	pip install chiluka
	```

	Or from GitHub:

	```bash
	pip install git+https://github.com/PurviewVoiceBot/chiluka.git
	```

	System dependency (required for phonemization):

	```bash
	# Ubuntu/Debian
	sudo apt-get install espeak-ng

	# macOS
	brew install espeak-ng
	```

	## Quick Start

	```python
	from chiluka import Chiluka

	# Load model (weights download automatically on first use)
	tts = Chiluka.from_pretrained()

	# Synthesize speech
	wav = tts.synthesize(
	text="Hello, this is Chiluka speaking!",
	reference_audio="path/to/reference.wav",
	language="en"
	)

	# Save output
	tts.save_wav(wav, "output.wav")
	```

	## Choose a Model

	```python
	from chiluka import Chiluka

	# Hindi + English (default)
	tts = Chiluka.from_pretrained(model="hindi_english")

	# Telugu + English
	tts = Chiluka.from_pretrained(model="telugu")
	```

	## Hindi Example

	```python
	tts = Chiluka.from_pretrained()

	wav = tts.synthesize(
	text="नमस्ते, मैं चिलुका बोल रहा हूं",
	reference_audio="reference.wav",
	language="hi"
	)
	tts.save_wav(wav, "hindi_output.wav")
	```

	## Telugu Example

	```python
	tts = Chiluka.from_pretrained(model="telugu")

	wav = tts.synthesize(
	text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
	reference_audio="reference.wav",
	language="te"
	)
	tts.save_wav(wav, "telugu_output.wav")
	```

	## PyTorch Hub

	```python
	import torch

	# Hindi-English (default)
	tts = torch.hub.load('Seemanth/chiluka', 'chiluka')

	# Telugu
	tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')

	wav = tts.synthesize("Hello!", "reference.wav", language="en")
	```

	## Synthesis Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `text` \| required \| Input text to synthesize \|
	\| `reference_audio` \| required \| Path to reference audio for voice style \|
	\| `language` \| `"en"` \| Language code (`en`, `hi`, `te`, etc.) \|
	\| `alpha` \| `0.3` \| Acoustic style mixing (0 = reference voice, 1 = predicted) \|
	\| `beta` \| `0.7` \| Prosodic style mixing (0 = reference prosody, 1 = predicted) \|
	\| `diffusion_steps` \| `5` \| More steps = better quality, slower inference \|
	\| `embedding_scale` \| `1.0` \| Classifier-free guidance strength \|

	## How It Works

	Chiluka uses a StyleTTS2-based pipeline:

	1. Text is converted to phonemes using espeak-ng
	2. PL-BERT encodes text into contextual embeddings
	3. Reference audio is processed to extract a style vector
	4. Diffusion model samples a style conditioned on text
	5. Prosody predictor generates duration, pitch (F0), and energy
	6. HiFi-GAN decoder synthesizes the final waveform at 24kHz

	## Model Architecture

	- Text Encoder: Token embedding + CNN + BiLSTM
	- Style Encoder: Conv2D + Residual blocks (style_dim=128)
	- Prosody Predictor: LSTM-based with AdaIN normalization
	- Diffusion Model: Transformer-based denoiser with ADPM2 sampler
	- Decoder: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
	- Pretrained sub-models: PL-BERT (text), ASR (alignment), JDC (pitch)

	## File Structure

	```
	├── configs/
	│ ├── config_ft.yml # Telugu model config
	│ └── config_hindi_english.yml # Hindi-English model config
	├── checkpoints/
	│ ├── epoch_2nd_00017.pth # Telugu checkpoint (~2GB)
	│ └── epoch_2nd_00029.pth # Hindi-English checkpoint (~2GB)
	├── pretrained/ # Shared pretrained sub-models
	│ ├── ASR/ # Text-to-mel alignment
	│ ├── JDC/ # Pitch extraction (F0)
	│ └── PLBERT/ # Text encoder
	├── models/ # Model architecture code
	│ ├── core.py
	│ ├── hifigan.py
	│ └── diffusion/
	├── inference.py # Main API
	├── hub.py # HuggingFace Hub utilities
	└── text_utils.py # Phoneme tokenization
	```

	## Requirements

	- Python >= 3.8
	- PyTorch >= 1.13.0
	- CUDA recommended (works on CPU too)
	- espeak-ng system package

	## Limitations

	- Requires a reference audio file for style/voice transfer
	- Quality depends on the reference audio quality
	- Best results with 3-15 second reference clips
	- Hindi-English model trained on 5 speakers
	- Telugu model trained on 1 speaker

	## Citation

	Based on StyleTTS2:

	```bibtex
	@inproceedings{li2024styletts,
	title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
	author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
	booktitle={NeurIPS},
	year={2024}
	}
	```

	## License

	MIT License

	## Links

	- GitHub: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
	- PyPI: [chiluka](https://pypi.org/project/chiluka/)