chiluka / MODEL_CARD.md

seemanthraju

Added streaming funciton

393129e 5 days ago

4.37 kB

	---
	language:
	- en
	- hi
	- te
	license: mit
	library_name: chiluka
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- tts
	- styletts2
	- voice-cloning
	- multi-language
	- hindi
	- english
	- telugu
	- multi-speaker
	- style-transfer
	---

	# Chiluka TTS

	Chiluka (చిలుక - Telugu for "parrot") is a lightweight Text-to-Speech model based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) with style transfer from reference audio.

	## Available Models

	\| Model \| Name \| Languages \| Speakers \|
	\|-------\|------\|-----------\|----------\|
	\| Hindi-English (default) \| `hindi_english` \| Hindi, English \| 5 \|
	\| Telugu \| `telugu` \| Telugu, English \| 1 \|

	## Installation

	```bash
	pip install git+https://github.com/PurviewVoiceBot/chiluka.git

	# Required system dependency
	sudo apt-get install espeak-ng # Ubuntu/Debian
	```

	## Usage

	Model weights download automatically on first use.

	```python
	from chiluka import Chiluka

	# Load Hindi-English model (default)
	tts = Chiluka.from_pretrained()

	# Or Telugu model
	# tts = Chiluka.from_pretrained(model="telugu")

	wav = tts.synthesize(
	text="Hello, this is Chiluka speaking!",
	reference_audio="path/to/reference.wav",
	language="en-us"
	)
	tts.save_wav(wav, "output.wav")
	```

	### Hindi

	```python
	tts = Chiluka.from_pretrained()
	wav = tts.synthesize(
	text="नमस्ते, मैं चिलुका बोल रहा हूं",
	reference_audio="reference.wav",
	language="hi"
	)
	tts.save_wav(wav, "hindi_output.wav")
	```

	### Telugu

	```python
	tts = Chiluka.from_pretrained(model="telugu")
	wav = tts.synthesize(
	text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
	reference_audio="reference.wav",
	language="te"
	)
	tts.save_wav(wav, "telugu_output.wav")
	```

	## Streaming Audio

	For WebRTC, WebSocket, or HTTP streaming:

	```python
	wav = tts.synthesize("Hello!", "reference.wav", language="en-us")

	# Get audio as bytes (no disk write)
	mp3_bytes = tts.to_audio_bytes(wav, format="mp3") # requires pydub + ffmpeg
	wav_bytes = tts.to_audio_bytes(wav, format="wav")
	pcm_bytes = tts.to_audio_bytes(wav, format="pcm") # raw 16-bit PCM

	# Stream chunked audio
	for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
	websocket.send(chunk) # PCM chunks by default

	# Stream as MP3 chunks
	for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
	response.write(chunk)
	```

	## Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `text` \| required \| Input text to synthesize \|
	\| `reference_audio` \| required \| Path to reference audio for voice style \|
	\| `language` \| `"en-us"` \| espeak-ng language code (see below) \|
	\| `alpha` \| `0.3` \| Acoustic style mixing (0 = reference, 1 = predicted) \|
	\| `beta` \| `0.7` \| Prosodic style mixing (0 = reference, 1 = predicted) \|
	\| `diffusion_steps` \| `5` \| More steps = better quality, slower \|
	\| `embedding_scale` \| `1.0` \| Classifier-free guidance strength \|

	## Language Codes

	\| Language \| Code \| Available In \|
	\|----------\|------\|-------------\|
	\| English (US) \| `en-us` \| All models \|
	\| English (UK) \| `en-gb` \| All models \|
	\| Hindi \| `hi` \| `hindi_english` \|
	\| Telugu \| `te` \| `telugu` \|

	## Architecture

	- Text Encoder: Token embedding + CNN + BiLSTM
	- Style Encoder: Conv2D + Residual blocks (style_dim=128)
	- Prosody Predictor: LSTM-based with AdaIN normalization
	- Diffusion Model: Transformer-based denoiser with ADPM2 sampler
	- Decoder: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
	- Pretrained sub-models: PL-BERT (text), ASR (alignment), JDC (pitch)

	## Requirements

	- Python >= 3.8
	- PyTorch >= 1.13.0
	- CUDA recommended
	- espeak-ng
	- pydub + ffmpeg (only for MP3/OGG streaming)

	## Citation

	Based on StyleTTS2:

	```bibtex
	@inproceedings{li2024styletts,
	title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
	author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
	booktitle={NeurIPS},
	year={2024}
	}
	```

	## License

	MIT License

	## Links

	- GitHub: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
	- HuggingFace: [Seemanth/chiluka](https://huggingface.co/Seemanth/chiluka)