chiluka / README.md

seemanthraju

Added streaming funciton

393129e 5 days ago

6.04 kB

	# Chiluka

	Chiluka (చిలుక - Telugu for "parrot") is a lightweight TTS (Text-to-Speech) inference package based on StyleTTS2 with style transfer from reference audio.

	## Available Models

	\| Model \| Name \| Languages \| Speakers \| Description \|
	\|-------\|------\|-----------\|----------\|-------------\|
	\| Hindi-English (default) \| `hindi_english` \| Hindi, English \| 5 \| Multi-speaker Hindi + English TTS \|
	\| Telugu \| `telugu` \| Telugu, English \| 1 \| Single-speaker Telugu + English TTS \|

	Model weights are hosted on [HuggingFace](https://huggingface.co/Seemanth/chiluka) and downloaded automatically on first use.

	## Installation

	```bash
	pip install git+https://github.com/PurviewVoiceBot/chiluka.git
	```

	System dependency (required):

	```bash
	# Ubuntu/Debian
	sudo apt-get install espeak-ng

	# macOS
	brew install espeak-ng
	```

	## Quick Start

	```python
	from chiluka import Chiluka

	# Load Hindi-English model (default)
	tts = Chiluka.from_pretrained()

	# Synthesize speech
	wav = tts.synthesize(
	text="Hello, this is Chiluka speaking!",
	reference_audio="path/to/reference.wav",
	language="en-us"
	)

	# Save to file
	tts.save_wav(wav, "output.wav")
	```

	### Load a Specific Model

	```python
	# Hindi-English (default)
	tts = Chiluka.from_pretrained(model="hindi_english")

	# Telugu
	tts = Chiluka.from_pretrained(model="telugu")
	```

	## Examples

	### Hindi

	```python
	tts = Chiluka.from_pretrained()

	wav = tts.synthesize(
	text="नमस्ते, मैं चिलुका बोल रहा हूं",
	reference_audio="reference.wav",
	language="hi"
	)
	tts.save_wav(wav, "hindi_output.wav")
	```

	### English

	```python
	wav = tts.synthesize(
	text="Hello, I am Chiluka, a text to speech system.",
	reference_audio="reference.wav",
	language="en-us"
	)
	tts.save_wav(wav, "english_output.wav")
	```

	### Telugu

	```python
	tts = Chiluka.from_pretrained(model="telugu")

	wav = tts.synthesize(
	text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
	reference_audio="reference.wav",
	language="te"
	)
	tts.save_wav(wav, "telugu_output.wav")
	```

	## Streaming Audio

	For real-time applications (WebRTC, WebSocket, HTTP streaming), Chiluka can generate audio as bytes or chunked streams without writing to disk.

	### Get Audio Bytes

	```python
	wav = tts.synthesize("Hello!", "reference.wav", language="en-us")

	# WAV bytes
	wav_bytes = tts.to_audio_bytes(wav, format="wav")

	# MP3 bytes (requires: pip install pydub, and ffmpeg installed)
	mp3_bytes = tts.to_audio_bytes(wav, format="mp3")

	# Raw PCM bytes (16-bit signed int, for WebRTC)
	pcm_bytes = tts.to_audio_bytes(wav, format="pcm")

	# OGG bytes
	ogg_bytes = tts.to_audio_bytes(wav, format="ogg")
	```

	### Stream Audio Chunks

	```python
	# Stream PCM chunks over WebSocket
	for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
	websocket.send(chunk)

	# Stream MP3 chunks for HTTP response
	for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
	response.write(chunk)

	# Custom chunk size (default 4800 samples = 200ms at 24kHz)
	for chunk in tts.synthesize_stream("Hello!", "reference.wav", chunk_size=2400):
	process(chunk)
	```

	## API Reference

	### Chiluka.from_pretrained()

	```python
	tts = Chiluka.from_pretrained(
	model="hindi_english", # "hindi_english" or "telugu"
	device="cuda", # "cuda" or "cpu" (auto-detects if None)
	force_download=False, # Re-download even if cached
	)
	```

	### synthesize()

	```python
	wav = tts.synthesize(
	text="Hello world", # Text to synthesize
	reference_audio="ref.wav", # Reference audio for style
	language="en-us", # Language code
	alpha=0.3, # Acoustic style mixing (0-1)
	beta=0.7, # Prosodic style mixing (0-1)
	diffusion_steps=5, # Quality vs speed tradeoff
	embedding_scale=1.0, # Classifier-free guidance
	sr=24000 # Sample rate
	)
	```

	### to_audio_bytes()

	```python
	audio_bytes = tts.to_audio_bytes(
	wav, # Numpy array from synthesize()
	format="mp3", # "wav", "mp3", "ogg", "flac", "pcm"
	sr=24000, # Sample rate
	bitrate="128k" # Bitrate for mp3/ogg
	)
	```

	### synthesize_stream()

	```python
	for chunk in tts.synthesize_stream(
	text="Hello world", # Text to synthesize
	reference_audio="ref.wav", # Reference audio for style
	language="en-us", # Language code
	format="pcm", # "pcm", "wav", "mp3", "ogg"
	chunk_size=4800, # Samples per chunk (200ms at 24kHz)
	sr=24000, # Sample rate
	):
	process(chunk)
	```

	### Other Methods

	```python
	tts.save_wav(wav, "output.wav") # Save to WAV file
	tts.play(wav) # Play via speakers (requires pyaudio)
	style = tts.compute_style("reference.wav") # Get style embedding
	```

	## Synthesis Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `alpha` \| 0.3 \| Acoustic style mixing (0=reference only, 1=predicted only) \|
	\| `beta` \| 0.7 \| Prosodic style mixing (0=reference only, 1=predicted only) \|
	\| `diffusion_steps` \| 5 \| Diffusion sampling steps (more = better quality, slower) \|
	\| `embedding_scale` \| 1.0 \| Classifier-free guidance scale \|

	## Language Codes

	These are espeak-ng language codes passed to the `language` parameter:

	\| Language \| Code \| Available In \|
	\|----------\|------\|-------------\|
	\| English (US) \| `en-us` \| All models \|
	\| English (UK) \| `en-gb` \| All models \|
	\| Hindi \| `hi` \| `hindi_english` \|
	\| Telugu \| `te` \| `telugu` \|

	## Requirements

	- Python >= 3.8
	- PyTorch >= 1.13.0
	- CUDA (recommended)
	- espeak-ng
	- pydub + ffmpeg (only for MP3/OGG streaming)

	## Credits

	Based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) by Yinghao Aaron Li et al.

	## License

	MIT License