chiluka / MODEL_CARD.md

seemanthraju

Added streaming funciton

393129e 5 days ago

4.37 kB

language:
  - en
  - hi
  - te
license: mit
library_name: chiluka
pipeline_tag: text-to-speech
tags:
  - text-to-speech
  - tts
  - styletts2
  - voice-cloning
  - multi-language
  - hindi
  - english
  - telugu
  - multi-speaker
  - style-transfer

Chiluka TTS

Chiluka (చిలుక - Telugu for "parrot") is a lightweight Text-to-Speech model based on StyleTTS2 with style transfer from reference audio.

Available Models

Model	Name	Languages	Speakers
Hindi-English (default)	`hindi_english`	Hindi, English	5
Telugu	`telugu`	Telugu, English	1

Installation

pip install git+https://github.com/PurviewVoiceBot/chiluka.git

# Required system dependency
sudo apt-get install espeak-ng    # Ubuntu/Debian

Usage

Model weights download automatically on first use.

from chiluka import Chiluka

# Load Hindi-English model (default)
tts = Chiluka.from_pretrained()

# Or Telugu model
# tts = Chiluka.from_pretrained(model="telugu")

wav = tts.synthesize(
    text="Hello, this is Chiluka speaking!",
    reference_audio="path/to/reference.wav",
    language="en-us"
)
tts.save_wav(wav, "output.wav")

Hindi

tts = Chiluka.from_pretrained()
wav = tts.synthesize(
    text="नमस्ते, मैं चिलुका बोल रहा हूं",
    reference_audio="reference.wav",
    language="hi"
)
tts.save_wav(wav, "hindi_output.wav")

Telugu

tts = Chiluka.from_pretrained(model="telugu")
wav = tts.synthesize(
    text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
    reference_audio="reference.wav",
    language="te"
)
tts.save_wav(wav, "telugu_output.wav")

Streaming Audio

For WebRTC, WebSocket, or HTTP streaming:

wav = tts.synthesize("Hello!", "reference.wav", language="en-us")

# Get audio as bytes (no disk write)
mp3_bytes = tts.to_audio_bytes(wav, format="mp3")    # requires pydub + ffmpeg
wav_bytes = tts.to_audio_bytes(wav, format="wav")
pcm_bytes = tts.to_audio_bytes(wav, format="pcm")    # raw 16-bit PCM

# Stream chunked audio
for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
    websocket.send(chunk)  # PCM chunks by default

# Stream as MP3 chunks
for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
    response.write(chunk)

Parameters

Parameter	Default	Description
`text`	required	Input text to synthesize
`reference_audio`	required	Path to reference audio for voice style
`language`	`"en-us"`	espeak-ng language code (see below)
`alpha`	`0.3`	Acoustic style mixing (0 = reference, 1 = predicted)
`beta`	`0.7`	Prosodic style mixing (0 = reference, 1 = predicted)
`diffusion_steps`	`5`	More steps = better quality, slower
`embedding_scale`	`1.0`	Classifier-free guidance strength

Language Codes

Language	Code	Available In
English (US)	`en-us`	All models
English (UK)	`en-gb`	All models
Hindi	`hi`	`hindi_english`
Telugu	`te`	`telugu`

Architecture

Text Encoder: Token embedding + CNN + BiLSTM
Style Encoder: Conv2D + Residual blocks (style_dim=128)
Prosody Predictor: LSTM-based with AdaIN normalization
Diffusion Model: Transformer-based denoiser with ADPM2 sampler
Decoder: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
Pretrained sub-models: PL-BERT (text), ASR (alignment), JDC (pitch)

Requirements

Python >= 3.8
PyTorch >= 1.13.0
CUDA recommended
espeak-ng
pydub + ffmpeg (only for MP3/OGG streaming)

Citation

Based on StyleTTS2:

@inproceedings{li2024styletts,
  title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author={Li, Yinghao Aaron and Han, Cong and Raber, Vinay S and Mesgarani, Nima},
  booktitle={NeurIPS},
  year={2024}
}

License

MIT License

Seemanth
/

chiluka