Added streaming funciton

Files changed (12) hide show

.gitignore +3 -0
MODEL_CARD.md +48 -99
README.md +82 -181
README_HF.md +0 -92
chiluka/__init__.py +1 -1
chiluka/hub.py +1 -1
chiluka/inference.py +131 -4
examples/basic_synthesis.py +1 -1
examples/huggingface_example.py +2 -2
examples/pip_example.py +2 -2
examples/torchhub_example.py +2 -2
hubconf.py +2 -2

.gitignore CHANGED Viewed

@@ -80,3 +80,6 @@ test_outputs/
 # Large checkpoint files (hosted on Hugging Face: https://huggingface.co/Seemanth/chiluka)
 chiluka/checkpoints/epoch_2nd_00017.pth
 chiluka/checkpoints/epoch_2nd_00029.pth

 # Large checkpoint files (hosted on Hugging Face: https://huggingface.co/Seemanth/chiluka)
 chiluka/checkpoints/epoch_2nd_00017.pth
 chiluka/checkpoints/epoch_2nd_00029.pth
+# Deploy commands (local only)
+DEPLOY.md

MODEL_CARD.md CHANGED Viewed

@@ -21,75 +21,49 @@ tags:
 # Chiluka TTS
-**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on [StyleTTS2](https://github.com/yl4579/StyleTTS2).
-It supports **style transfer from reference audio** - give it a voice sample and it will speak in that style.
 ## Available Models
-| Model | Name | Languages | Speakers | Description |
-|-------|------|-----------|----------|-------------|
-| **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
-| **Telugu** | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |
 ## Installation
-```bash
-pip install chiluka
-```
-Or from GitHub:
 ```bash
 pip install git+https://github.com/PurviewVoiceBot/chiluka.git
-```
-**System dependency** (required for phonemization):
-```bash
-# Ubuntu/Debian
-sudo apt-get install espeak-ng
-# macOS
-brew install espeak-ng
 ```
-## Quick Start
 ```python
 from chiluka import Chiluka
-# Load model (weights download automatically on first use)
 tts = Chiluka.from_pretrained()
-# Synthesize speech
 wav = tts.synthesize(
     text="Hello, this is Chiluka speaking!",
     reference_audio="path/to/reference.wav",
-    language="en"
 )
-# Save output
 tts.save_wav(wav, "output.wav")
 ```
-## Choose a Model
-```python
-from chiluka import Chiluka
-# Hindi + English (default)
-tts = Chiluka.from_pretrained(model="hindi_english")
-# Telugu + English
-tts = Chiluka.from_pretrained(model="telugu")
-```
-## Hindi Example
 ```python
 tts = Chiluka.from_pretrained()
 wav = tts.synthesize(
     text="नमस्ते, मैं चिलुका बोल रहा हूं",
     reference_audio="reference.wav",
@@ -98,11 +72,10 @@ wav = tts.synthesize(
 tts.save_wav(wav, "hindi_output.wav")
 ```
-## Telugu Example
 ```python
 tts = Chiluka.from_pretrained(model="telugu")
 wav = tts.synthesize(
     text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
     reference_audio="reference.wav",
@@ -111,44 +84,49 @@ wav = tts.synthesize(
 tts.save_wav(wav, "telugu_output.wav")
 ```
-## PyTorch Hub
 ```python
-import torch
-# Hindi-English (default)
-tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
-# Telugu
-tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')
-wav = tts.synthesize("Hello!", "reference.wav", language="en")
 ```
-## Synthesis Parameters
 | Parameter | Default | Description |
 |-----------|---------|-------------|
 | `text` | required | Input text to synthesize |
 | `reference_audio` | required | Path to reference audio for voice style |
-| `language` | `"en"` | Language code (`en`, `hi`, `te`, etc.) |
-| `alpha` | `0.3` | Acoustic style mixing (0 = reference voice, 1 = predicted) |
-| `beta` | `0.7` | Prosodic style mixing (0 = reference prosody, 1 = predicted) |
-| `diffusion_steps` | `5` | More steps = better quality, slower inference |
 | `embedding_scale` | `1.0` | Classifier-free guidance strength |
-## How It Works
-Chiluka uses a StyleTTS2-based pipeline:
-1. **Text** is converted to phonemes using espeak-ng
-2. **PL-BERT** encodes text into contextual embeddings
-3. **Reference audio** is processed to extract a style vector
-4. **Diffusion model** samples a style conditioned on text
-5. **Prosody predictor** generates duration, pitch (F0), and energy
-6. **HiFi-GAN decoder** synthesizes the final waveform at 24kHz
-## Model Architecture
 - **Text Encoder**: Token embedding + CNN + BiLSTM
 - **Style Encoder**: Conv2D + Residual blocks (style_dim=128)
@@ -157,42 +135,13 @@ Chiluka uses a StyleTTS2-based pipeline:
 - **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
 - **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch)
-## File Structure
-```
-├── configs/
-│   ├── config_ft.yml                 # Telugu model config
-│   └── config_hindi_english.yml      # Hindi-English model config
-├── checkpoints/
-│   ├── epoch_2nd_00017.pth           # Telugu checkpoint (~2GB)
-│   └── epoch_2nd_00029.pth           # Hindi-English checkpoint (~2GB)
-├── pretrained/                       # Shared pretrained sub-models
-│   ├── ASR/                          # Text-to-mel alignment
-│   ├── JDC/                          # Pitch extraction (F0)
-│   └── PLBERT/                       # Text encoder
-├── models/                           # Model architecture code
-│   ├── core.py
-│   ├── hifigan.py
-│   └── diffusion/
-├── inference.py                      # Main API
-├── hub.py                            # HuggingFace Hub utilities
-└── text_utils.py                     # Phoneme tokenization
-```
 ## Requirements
 - Python >= 3.8
 - PyTorch >= 1.13.0
-- CUDA recommended (works on CPU too)
-- espeak-ng system package
-## Limitations
-- Requires a reference audio file for style/voice transfer
-- Quality depends on the reference audio quality
-- Best results with 3-15 second reference clips
-- Hindi-English model trained on 5 speakers
-- Telugu model trained on 1 speaker
 ## Citation
@@ -214,4 +163,4 @@ MIT License
 ## Links
 - **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
-- **PyPI**: [chiluka](https://pypi.org/project/chiluka/)

 # Chiluka TTS
+**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight Text-to-Speech model based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) with style transfer from reference audio.
 ## Available Models
+| Model | Name | Languages | Speakers |
+|-------|------|-----------|----------|
+| **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 |
+| **Telugu** | `telugu` | Telugu, English | 1 |
 ## Installation
 ```bash
 pip install git+https://github.com/PurviewVoiceBot/chiluka.git
+# Required system dependency
+sudo apt-get install espeak-ng    # Ubuntu/Debian
 ```
+## Usage
+Model weights download automatically on first use.
 ```python
 from chiluka import Chiluka
+# Load Hindi-English model (default)
 tts = Chiluka.from_pretrained()
+# Or Telugu model
+# tts = Chiluka.from_pretrained(model="telugu")
 wav = tts.synthesize(
     text="Hello, this is Chiluka speaking!",
     reference_audio="path/to/reference.wav",
+    language="en-us"
 )
 tts.save_wav(wav, "output.wav")
 ```
+### Hindi
 ```python
 tts = Chiluka.from_pretrained()
 wav = tts.synthesize(
     text="नमस्ते, मैं चिलुका बोल रहा हूं",
     reference_audio="reference.wav",
 tts.save_wav(wav, "hindi_output.wav")
 ```
+### Telugu
 ```python
 tts = Chiluka.from_pretrained(model="telugu")
 wav = tts.synthesize(
     text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
     reference_audio="reference.wav",
 tts.save_wav(wav, "telugu_output.wav")
 ```
+## Streaming Audio
+For WebRTC, WebSocket, or HTTP streaming:
 ```python
+wav = tts.synthesize("Hello!", "reference.wav", language="en-us")
+# Get audio as bytes (no disk write)
+mp3_bytes = tts.to_audio_bytes(wav, format="mp3")    # requires pydub + ffmpeg
+wav_bytes = tts.to_audio_bytes(wav, format="wav")
+pcm_bytes = tts.to_audio_bytes(wav, format="pcm")    # raw 16-bit PCM
+# Stream chunked audio
+for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
+    websocket.send(chunk)  # PCM chunks by default
+# Stream as MP3 chunks
+for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
+    response.write(chunk)
 ```
+## Parameters
 | Parameter | Default | Description |
 |-----------|---------|-------------|
 | `text` | required | Input text to synthesize |
 | `reference_audio` | required | Path to reference audio for voice style |
+| `language` | `"en-us"` | espeak-ng language code (see below) |
+| `alpha` | `0.3` | Acoustic style mixing (0 = reference, 1 = predicted) |
+| `beta` | `0.7` | Prosodic style mixing (0 = reference, 1 = predicted) |
+| `diffusion_steps` | `5` | More steps = better quality, slower |
 | `embedding_scale` | `1.0` | Classifier-free guidance strength |
+## Language Codes
+| Language | Code | Available In |
+|----------|------|-------------|
+| English (US) | `en-us` | All models |
+| English (UK) | `en-gb` | All models |
+| Hindi | `hi` | `hindi_english` |
+| Telugu | `te` | `telugu` |
+## Architecture
 - **Text Encoder**: Token embedding + CNN + BiLSTM
 - **Style Encoder**: Conv2D + Residual blocks (style_dim=128)
 - **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
 - **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch)
 ## Requirements
 - Python >= 3.8
 - PyTorch >= 1.13.0
+- CUDA recommended
+- espeak-ng
+- pydub + ffmpeg (only for MP3/OGG streaming)
 ## Citation
 ## Links
 - **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
+- **HuggingFace**: [Seemanth/chiluka](https://huggingface.co/Seemanth/chiluka)

README.md CHANGED Viewed

@@ -1,14 +1,6 @@
 # Chiluka
-**Chiluka** (చిలుక - Telugu for "parrot") is a self-contained TTS (Text-to-Speech) inference package based on StyleTTS2.
-## Features
-- Simple, clean API for TTS synthesis
-- Style transfer from reference audio
-- Multi-language support via phonemizer
-- **Multiple models** - Hindi-English and Telugu
-- **Multiple ways to load** - HuggingFace Hub, PyTorch Hub, pip install
 ## Available Models
@@ -17,29 +9,15 @@
 | Hindi-English (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
 | Telugu | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |
-## Installation
-### Option 1: pip install
-```bash
-pip install chiluka
-```
-### Option 2: Install from GitHub
 ```bash
 pip install git+https://github.com/PurviewVoiceBot/chiluka.git
 ```
-### Option 3: From Source
-```bash
-git clone https://github.com/PurviewVoiceBot/chiluka.git
-cd chiluka
-pip install -e .
-```
-### System Dependency: espeak-ng (Required)
 ```bash
 # Ubuntu/Debian
@@ -51,10 +29,6 @@ brew install espeak-ng
 ## Quick Start
-### HuggingFace Hub (Recommended)
-Model weights download automatically on first use. No cloning needed.
 ```python
 from chiluka import Chiluka
@@ -65,7 +39,7 @@ tts = Chiluka.from_pretrained()
 wav = tts.synthesize(
     text="Hello, this is Chiluka speaking!",
     reference_audio="path/to/reference.wav",
-    language="en"
 )
 # Save to file
@@ -75,8 +49,6 @@ tts.save_wav(wav, "output.wav")
 ### Load a Specific Model
 ```python
-from chiluka import Chiluka
 # Hindi-English (default)
 tts = Chiluka.from_pretrained(model="hindi_english")
@@ -84,111 +56,92 @@ tts = Chiluka.from_pretrained(model="hindi_english")
 tts = Chiluka.from_pretrained(model="telugu")
 ```
-### PyTorch Hub
-```python
-import torch
-# Hindi-English (default)
-tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
-# Telugu
-tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')
-# Synthesize
-wav = tts.synthesize(
-    text="Hello from PyTorch Hub!",
-    reference_audio="reference.wav",
-    language="en"
-)
-```
-### Local Weights (if you cloned with Git LFS)
-```python
-from chiluka import Chiluka
-tts = Chiluka()  # uses bundled weights from cloned repo
-```
 ## Examples
-### Hindi Synthesis
 ```python
-from chiluka import Chiluka
-tts = Chiluka.from_pretrained(model="hindi_english")
 wav = tts.synthesize(
     text="नमस्ते, मैं चिलुका बोल रहा हूं",
-    reference_audio="hindi_reference.wav",
     language="hi"
 )
 tts.save_wav(wav, "hindi_output.wav")
 ```
-### English Synthesis
 ```python
 wav = tts.synthesize(
     text="Hello, I am Chiluka, a text to speech system.",
-    reference_audio="english_reference.wav",
-    language="en"
 )
 tts.save_wav(wav, "english_output.wav")
 ```
-### Telugu Synthesis
 ```python
-from chiluka import Chiluka
 tts = Chiluka.from_pretrained(model="telugu")
 wav = tts.synthesize(
     text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
-    reference_audio="telugu_reference.wav",
     language="te"
 )
 tts.save_wav(wav, "telugu_output.wav")
 ```
-### List Available Models
 ```python
-from chiluka import list_models
-models = list_models()
-for name, info in models.items():
-    print(f"{name}: {info['description']} ({', '.join(info['languages'])})")
 ```
 ## API Reference
-### Loading the Model
 ```python
-# Auto-download from HuggingFace (recommended)
-tts = Chiluka.from_pretrained()                              # Hindi-English (default)
-tts = Chiluka.from_pretrained(model="telugu")                # Telugu
-tts = Chiluka.from_pretrained(model="hindi_english")         # Hindi-English (explicit)
-# With options
 tts = Chiluka.from_pretrained(
-    model="hindi_english",          # Model variant
-    repo_id="Seemanth/chiluka", # HuggingFace repo
-    device="cuda",                  # or "cpu"
-    force_download=False,           # Re-download even if cached
-    token="hf_xxx"                  # For private repos
-)
-# Local weights
-tts = Chiluka(
-    config_path="path/to/config.yml",
-    checkpoint_path="path/to/model.pth",
-    pretrained_dir="path/to/pretrained/",
-    device="cuda"
 )
 ```
@@ -198,7 +151,7 @@ tts = Chiluka(
 wav = tts.synthesize(
     text="Hello world",           # Text to synthesize
     reference_audio="ref.wav",    # Reference audio for style
-    language="en",                # Language code
     alpha=0.3,                    # Acoustic style mixing (0-1)
     beta=0.7,                     # Prosodic style mixing (0-1)
     diffusion_steps=5,            # Quality vs speed tradeoff
@@ -207,17 +160,37 @@ wav = tts.synthesize(
 )
 ```
-### Other Methods
 ```python
-# Save audio to file
-tts.save_wav(wav, "output.wav", sr=24000)
-# Play audio (requires pyaudio)
-tts.play(wav, sr=24000)
-# Get style embedding from audio
-style = tts.compute_style("reference.wav", sr=24000)
 ```
 ## Synthesis Parameters
@@ -229,9 +202,9 @@ style = tts.compute_style("reference.wav", sr=24000)
 | `diffusion_steps` | 5 | Diffusion sampling steps (more = better quality, slower) |
 | `embedding_scale` | 1.0 | Classifier-free guidance scale |
-## Supported Languages
-Uses [phonemizer](https://github.com/bootphon/phonemizer) with espeak-ng:
 | Language | Code | Available In |
 |----------|------|-------------|
@@ -239,86 +212,14 @@ Uses [phonemizer](https://github.com/bootphon/phonemizer) with espeak-ng:
 | English (UK) | `en-gb` | All models |
 | Hindi | `hi` | `hindi_english` |
 | Telugu | `te` | `telugu` |
-| Tamil | `ta` | With fine-tuning |
-| Kannada | `kn` | With fine-tuning |
-## Hub Utilities
-```python
-from chiluka import list_models, clear_cache, push_to_hub, get_cache_dir
-# List available models
-list_models()
-# Clear cache
-clear_cache()                           # Clear all
-clear_cache("Seemanth/chiluka")     # Clear specific repo
-# Push your own model to HuggingFace
-push_to_hub(
-    local_dir="./my-model",
-    repo_id="myusername/my-chiluka-model",
-    token="hf_your_token"
-)
-# Check cache location
-print(get_cache_dir())  # ~/.cache/chiluka
-```
-## Environment Variables
-| Variable | Description |
-|----------|-------------|
-| `CHILUKA_CACHE` | Custom cache directory (default: `~/.cache/chiluka`) |
-| `HF_TOKEN` | HuggingFace API token for private repos |
 ## Requirements
 - Python >= 3.8
 - PyTorch >= 1.13.0
-- CUDA (recommended for faster inference)
 - espeak-ng
-## Package Structure
-```
-chiluka/
-├── chiluka/
-│   ├── __init__.py
-│   ├── inference.py              # Main Chiluka API
-│   ├── hub.py                    # Hub download + model registry
-│   ├── text_utils.py
-│   ├── utils.py
-│   ├── configs/
-│   │   ├── config_ft.yml         # Telugu model config
-│   │   └── config_hindi_english.yml  # Hindi-English model config
-│   ├── checkpoints/
-│   │   ���── epoch_2nd_00017.pth   # Telugu checkpoint
-│   │   └── epoch_2nd_00029.pth   # Hindi-English checkpoint
-│   ├── pretrained/               # Shared pretrained sub-models
-│   │   ├── ASR/
-│   │   ├── JDC/
-│   │   └── PLBERT/
-│   └── models/
-├── hubconf.py                    # PyTorch Hub config
-├── examples/
-│   ├── basic_synthesis.py
-│   ├── telugu_synthesis.py
-│   ├── huggingface_example.py
-│   ├── torchhub_example.py
-│   └── pip_example.py
-├── setup.py
-└── README.md
-```
-## Training Your Own Model
-This package is for **inference only**. To train your own model, use the original [StyleTTS2](https://github.com/yl4579/StyleTTS2) repository.
-After training:
-1. Copy your checkpoint and config to a directory
-2. Push to HuggingFace Hub using `push_to_hub()`
-3. Load with `Chiluka.from_pretrained("your-repo")`
 ## Credits

 # Chiluka
+**Chiluka** (చిలుక - Telugu for "parrot") is a lightweight TTS (Text-to-Speech) inference package based on StyleTTS2 with style transfer from reference audio.
 ## Available Models
 | Hindi-English (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
 | Telugu | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |
+Model weights are hosted on [HuggingFace](https://huggingface.co/Seemanth/chiluka) and downloaded automatically on first use.
+## Installation
 ```bash
 pip install git+https://github.com/PurviewVoiceBot/chiluka.git
 ```
+System dependency (required):
 ```bash
 # Ubuntu/Debian
 ## Quick Start
 ```python
 from chiluka import Chiluka
 wav = tts.synthesize(
     text="Hello, this is Chiluka speaking!",
     reference_audio="path/to/reference.wav",
+    language="en-us"
 )
 # Save to file
 ### Load a Specific Model
 ```python
 # Hindi-English (default)
 tts = Chiluka.from_pretrained(model="hindi_english")
 tts = Chiluka.from_pretrained(model="telugu")
 ```
 ## Examples
+### Hindi
 ```python
+tts = Chiluka.from_pretrained()
 wav = tts.synthesize(
     text="नमस्ते, मैं चिलुका बोल रहा हूं",
+    reference_audio="reference.wav",
     language="hi"
 )
 tts.save_wav(wav, "hindi_output.wav")
 ```
+### English
 ```python
 wav = tts.synthesize(
     text="Hello, I am Chiluka, a text to speech system.",
+    reference_audio="reference.wav",
+    language="en-us"
 )
 tts.save_wav(wav, "english_output.wav")
 ```
+### Telugu
 ```python
 tts = Chiluka.from_pretrained(model="telugu")
 wav = tts.synthesize(
     text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
+    reference_audio="reference.wav",
     language="te"
 )
 tts.save_wav(wav, "telugu_output.wav")
 ```
+## Streaming Audio
+For real-time applications (WebRTC, WebSocket, HTTP streaming), Chiluka can generate audio as bytes or chunked streams without writing to disk.
+### Get Audio Bytes
 ```python
+wav = tts.synthesize("Hello!", "reference.wav", language="en-us")
+# WAV bytes
+wav_bytes = tts.to_audio_bytes(wav, format="wav")
+# MP3 bytes (requires: pip install pydub, and ffmpeg installed)
+mp3_bytes = tts.to_audio_bytes(wav, format="mp3")
+# Raw PCM bytes (16-bit signed int, for WebRTC)
+pcm_bytes = tts.to_audio_bytes(wav, format="pcm")
+# OGG bytes
+ogg_bytes = tts.to_audio_bytes(wav, format="ogg")
+```
+### Stream Audio Chunks
+```python
+# Stream PCM chunks over WebSocket
+for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
+    websocket.send(chunk)
+# Stream MP3 chunks for HTTP response
+for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
+    response.write(chunk)
+# Custom chunk size (default 4800 samples = 200ms at 24kHz)
+for chunk in tts.synthesize_stream("Hello!", "reference.wav", chunk_size=2400):
+    process(chunk)
 ```
 ## API Reference
+### Chiluka.from_pretrained()
 ```python
 tts = Chiluka.from_pretrained(
+    model="hindi_english",      # "hindi_english" or "telugu"
+    device="cuda",              # "cuda" or "cpu" (auto-detects if None)
+    force_download=False,       # Re-download even if cached
 )
 ```
 wav = tts.synthesize(
     text="Hello world",           # Text to synthesize
     reference_audio="ref.wav",    # Reference audio for style
+    language="en-us",             # Language code
     alpha=0.3,                    # Acoustic style mixing (0-1)
     beta=0.7,                     # Prosodic style mixing (0-1)
     diffusion_steps=5,            # Quality vs speed tradeoff
 )
 ```
+### to_audio_bytes()
 ```python
+audio_bytes = tts.to_audio_bytes(
+    wav,                          # Numpy array from synthesize()
+    format="mp3",                 # "wav", "mp3", "ogg", "flac", "pcm"
+    sr=24000,                     # Sample rate
+    bitrate="128k"                # Bitrate for mp3/ogg
+)
+```
+### synthesize_stream()
+```python
+for chunk in tts.synthesize_stream(
+    text="Hello world",           # Text to synthesize
+    reference_audio="ref.wav",    # Reference audio for style
+    language="en-us",             # Language code
+    format="pcm",                 # "pcm", "wav", "mp3", "ogg"
+    chunk_size=4800,              # Samples per chunk (200ms at 24kHz)
+    sr=24000,                     # Sample rate
+):
+    process(chunk)
+```
+### Other Methods
+```python
+tts.save_wav(wav, "output.wav")                 # Save to WAV file
+tts.play(wav)                                   # Play via speakers (requires pyaudio)
+style = tts.compute_style("reference.wav")      # Get style embedding
 ```
 ## Synthesis Parameters
 | `diffusion_steps` | 5 | Diffusion sampling steps (more = better quality, slower) |
 | `embedding_scale` | 1.0 | Classifier-free guidance scale |
+## Language Codes
+These are espeak-ng language codes passed to the `language` parameter:
 | Language | Code | Available In |
 |----------|------|-------------|
 | English (UK) | `en-gb` | All models |
 | Hindi | `hi` | `hindi_english` |
 | Telugu | `te` | `telugu` |
 ## Requirements
 - Python >= 3.8
 - PyTorch >= 1.13.0
+- CUDA (recommended)
 - espeak-ng
+- pydub + ffmpeg (only for MP3/OGG streaming)
 ## Credits

README_HF.md DELETED Viewed

@@ -1,92 +0,0 @@
----
-language:
-  - en
-  - te
-  - hi
-license: mit
-library_name: chiluka
-tags:
-  - text-to-speech
-  - tts
-  - styletts2
-  - voice-cloning
----
-# Chiluka TTS
-Chiluka (చిలుక - Telugu for "parrot") is a lightweight Text-to-Speech model based on StyleTTS2.
-## Installation
-```bash
-pip install chiluka
-```
-Or install from source:
-```bash
-pip install git+https://github.com/Seemanth/chiluka.git
-```
-## Usage
-### Quick Start (Auto-download)
-```python
-from chiluka import Chiluka
-# Automatically downloads model weights
-tts = Chiluka.from_pretrained()
-# Generate speech
-wav = tts.synthesize(
-    text="Hello, world!",
-    reference_audio="path/to/reference.wav",
-    language="en"
-)
-# Save output
-tts.save_wav(wav, "output.wav")
-```
-### PyTorch Hub
-```python
-import torch
-tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
-wav = tts.synthesize("Hello!", "reference.wav", language="en")
-```
-### HuggingFace Hub
-```python
-from chiluka import Chiluka
-tts = Chiluka.from_pretrained("Seemanth/chiluka")
-```
-## Parameters
-- `text`: Input text to synthesize
-- `reference_audio`: Path to reference audio for style transfer
-- `language`: Language code ('en', 'te', 'hi', etc.)
-- `alpha`: Acoustic style mixing (0-1, default 0.3)
-- `beta`: Prosodic style mixing (0-1, default 0.7)
-- `diffusion_steps`: Quality vs speed tradeoff (default 5)
-## Supported Languages
-Uses espeak-ng phonemizer. Common languages:
-- English: `en-us`, `en-gb`
-- Telugu: `te`
-- Hindi: `hi`
-- Tamil: `ta`
-## License
-MIT License
-## Citation
-Based on StyleTTS2 by Yinghao Aaron Li et al.

chiluka/__init__.py CHANGED Viewed

@@ -17,7 +17,7 @@ Usage:
     wav = tts.synthesize(
         text="Hello, world!",
         reference_audio="reference.wav",
-        language="en"
     )
     tts.save_wav(wav, "output.wav")
 """

     wav = tts.synthesize(
         text="Hello, world!",
         reference_audio="reference.wav",
+        language="en-us"
     )
     tts.save_wav(wav, "output.wav")
 """

chiluka/hub.py CHANGED Viewed

@@ -318,7 +318,7 @@ tts = Chiluka.from_pretrained()
 wav = tts.synthesize(
     text="Hello, world!",
     reference_audio="reference.wav",
-    language="en"
 )
 tts.save_wav(wav, "output.wav")
 ```

 wav = tts.synthesize(
     text="Hello, world!",
     reference_audio="reference.wav",
+    language="en-us"
 )
 tts.save_wav(wav, "output.wav")
 ```

chiluka/inference.py CHANGED Viewed

@@ -11,13 +11,14 @@ Example usage:
     wav = tts.synthesize(
         text="Hello, world!",
         reference_audio="path/to/reference.wav",
-        language="en"
     )
     # Save to file
     tts.save_wav(wav, "output.wav")
 """
 import os
 import yaml
 import torch
@@ -25,7 +26,7 @@ import torchaudio
 import librosa
 import numpy as np
 from pathlib import Path
-from typing import Optional, Union
 from nltk.tokenize import word_tokenize
@@ -291,7 +292,7 @@ class Chiluka:
         self,
         text: str,
         reference_audio: str,
-        language: str = "en",
         alpha: float = 0.3,
         beta: float = 0.7,
         diffusion_steps: int = 5,
@@ -304,7 +305,7 @@ class Chiluka:
         Args:
             text: Input text to synthesize
             reference_audio: Path to reference audio for style transfer
-            language: Language code for phonemization (e.g., 'en', 'te', 'hi')
             alpha: Style mixing coefficient for acoustic features (0-1)
             beta: Style mixing coefficient for prosodic features (0-1)
             diffusion_steps: Number of diffusion sampling steps
@@ -432,3 +433,129 @@ class Chiluka:
         stream.stop_stream()
         stream.close()
         p.terminate()

     wav = tts.synthesize(
         text="Hello, world!",
         reference_audio="path/to/reference.wav",
+        language="en-us"
     )
     # Save to file
     tts.save_wav(wav, "output.wav")
 """
+import io
 import os
 import yaml
 import torch
 import librosa
 import numpy as np
 from pathlib import Path
+from typing import Optional, Union, Generator
 from nltk.tokenize import word_tokenize
         self,
         text: str,
         reference_audio: str,
+        language: str = "en-us",
         alpha: float = 0.3,
         beta: float = 0.7,
         diffusion_steps: int = 5,
         Args:
             text: Input text to synthesize
             reference_audio: Path to reference audio for style transfer
+            language: espeak-ng language code (e.g., 'en-us', 'hi', 'te')
             alpha: Style mixing coefficient for acoustic features (0-1)
             beta: Style mixing coefficient for prosodic features (0-1)
             diffusion_steps: Number of diffusion sampling steps
         stream.stop_stream()
         stream.close()
         p.terminate()
+    def to_audio_bytes(
+        self,
+        wav: np.ndarray,
+        format: str = "wav",
+        sr: int = 24000,
+        bitrate: str = "128k",
+    ) -> bytes:
+        """
+        Convert waveform to audio bytes in the specified format.
+        Useful for sending audio over HTTP, WebSocket, or WebRTC without
+        writing to disk.
+        Args:
+            wav: Audio waveform as numpy array (from synthesize())
+            format: Output format - "wav", "mp3", "ogg", "flac", "pcm"
+            sr: Sample rate
+            bitrate: Bitrate for compressed formats (mp3, ogg)
+        Returns:
+            Audio data as bytes
+        Examples:
+            >>> wav = tts.synthesize("Hello!", "ref.wav", language="en-us")
+            >>> # WAV bytes
+            >>> wav_bytes = tts.to_audio_bytes(wav, format="wav")
+            >>> # MP3 bytes (requires pydub + ffmpeg)
+            >>> mp3_bytes = tts.to_audio_bytes(wav, format="mp3")
+            >>> # Raw PCM bytes (16-bit signed int, for WebRTC)
+            >>> pcm_bytes = tts.to_audio_bytes(wav, format="pcm")
+        """
+        wav_int16 = (wav * 32767).clip(-32768, 32767).astype(np.int16)
+        if format == "pcm":
+            return wav_int16.tobytes()
+        if format == "wav":
+            buf = io.BytesIO()
+            import scipy.io.wavfile as wavfile
+            wavfile.write(buf, sr, wav_int16)
+            return buf.getvalue()
+        # mp3, ogg, flac - use pydub
+        try:
+            from pydub import AudioSegment
+        except ImportError:
+            raise ImportError(
+                f"pydub is required for '{format}' format. "
+                "Install with: pip install pydub\n"
+                "Also requires ffmpeg: sudo apt-get install ffmpeg"
+            )
+        segment = AudioSegment(
+            data=wav_int16.tobytes(),
+            sample_width=2,
+            frame_rate=sr,
+            channels=1,
+        )
+        buf = io.BytesIO()
+        segment.export(buf, format=format, bitrate=bitrate)
+        return buf.getvalue()
+    def synthesize_stream(
+        self,
+        text: str,
+        reference_audio: str,
+        language: str = "en-us",
+        format: str = "pcm",
+        chunk_size: int = 4800,
+        sr: int = 24000,
+        bitrate: str = "128k",
+        **synth_kwargs,
+    ) -> Generator[bytes, None, None]:
+        """
+        Synthesize speech and yield audio chunks for streaming.
+        Generates the full audio then yields it in chunks suitable for
+        real-time streaming over WebRTC, WebSocket, or HTTP chunked transfer.
+        Args:
+            text: Input text to synthesize
+            reference_audio: Path to reference audio for style transfer
+            language: Language code (e.g., "en-us", "hi", "te")
+            format: Output format per chunk - "pcm", "wav", "mp3", "ogg"
+            chunk_size: Number of samples per chunk (default 4800 = 200ms at 24kHz)
+            sr: Sample rate
+            bitrate: Bitrate for compressed formats
+            **synth_kwargs: Additional args passed to synthesize()
+                (alpha, beta, diffusion_steps, embedding_scale)
+        Yields:
+            Audio data chunks as bytes
+        Examples:
+            >>> # Stream PCM chunks over WebSocket
+            >>> for chunk in tts.synthesize_stream("Hello!", "ref.wav"):
+            ...     websocket.send(chunk)
+            >>> # Stream MP3 chunks
+            >>> for chunk in tts.synthesize_stream("Hello!", "ref.wav", format="mp3"):
+            ...     response.write(chunk)
+        """
+        wav = self.synthesize(
+            text=text,
+            reference_audio=reference_audio,
+            language=language,
+            sr=sr,
+            **synth_kwargs,
+        )
+        wav_int16 = (wav * 32767).clip(-32768, 32767).astype(np.int16)
+        if format == "pcm":
+            for i in range(0, len(wav_int16), chunk_size):
+                yield wav_int16[i:i + chunk_size].tobytes()
+            return
+        # For compressed formats, encode the full audio then chunk the bytes
+        audio_bytes = self.to_audio_bytes(wav, format=format, sr=sr, bitrate=bitrate)
+        byte_chunk_size = chunk_size * 4  # approximate byte size per chunk
+        for i in range(0, len(audio_bytes), byte_chunk_size):
+            yield audio_bytes[i:i + byte_chunk_size]

examples/basic_synthesis.py CHANGED Viewed

@@ -20,7 +20,7 @@ def main():
     parser = argparse.ArgumentParser(description="Chiluka TTS Synthesis")
     parser.add_argument("--reference", "-r", required=True, help="Path to reference audio file")
     parser.add_argument("--text", "-t", default="Hello, this is Chiluka speaking!", help="Text to synthesize")
-    parser.add_argument("--language", "-l", default="en", help="Language code (en, te, hi, etc.)")
     parser.add_argument("--output", "-o", default="output.wav", help="Output WAV file path")
     parser.add_argument("--alpha", type=float, default=0.3, help="Acoustic style mixing (0-1)")
     parser.add_argument("--beta", type=float, default=0.7, help="Prosodic style mixing (0-1)")

     parser = argparse.ArgumentParser(description="Chiluka TTS Synthesis")
     parser.add_argument("--reference", "-r", required=True, help="Path to reference audio file")
     parser.add_argument("--text", "-t", default="Hello, this is Chiluka speaking!", help="Text to synthesize")
+    parser.add_argument("--language", "-l", default="en-us", help="Language code (en-us, te, hi, etc.)")
     parser.add_argument("--output", "-o", default="output.wav", help="Output WAV file path")
     parser.add_argument("--alpha", type=float, default=0.3, help="Acoustic style mixing (0-1)")
     parser.add_argument("--beta", type=float, default=0.7, help="Prosodic style mixing (0-1)")

examples/huggingface_example.py CHANGED Viewed

@@ -23,7 +23,7 @@ def main():
     parser.add_argument("--model", type=str, default="hindi_english", choices=["hindi_english", "telugu"],
                         help="Model variant to use (default: hindi_english)")
     parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
-    parser.add_argument("--language", type=str, default=None, help="Language code (en, hi, te)")
     parser.add_argument("--output", type=str, default="output_hf.wav", help="Output wav file path")
     parser.add_argument("--device", type=str, default=None, help="Device: cuda or cpu")
     args = parser.parse_args()
@@ -46,7 +46,7 @@ def main():
         if args.model == "telugu":
             args.language = "te"
         else:
-            args.language = "en"
     # Load model from HuggingFace Hub (auto-downloads on first use)
     print(f"Loading '{args.model}' model from HuggingFace Hub...")

     parser.add_argument("--model", type=str, default="hindi_english", choices=["hindi_english", "telugu"],
                         help="Model variant to use (default: hindi_english)")
     parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
+    parser.add_argument("--language", type=str, default=None, help="Language code (en-us, hi, te)")
     parser.add_argument("--output", type=str, default="output_hf.wav", help="Output wav file path")
     parser.add_argument("--device", type=str, default=None, help="Device: cuda or cpu")
     args = parser.parse_args()
         if args.model == "telugu":
             args.language = "te"
         else:
+            args.language = "en-us"
     # Load model from HuggingFace Hub (auto-downloads on first use)
     print(f"Loading '{args.model}' model from HuggingFace Hub...")

examples/pip_example.py CHANGED Viewed

@@ -22,7 +22,7 @@ def main():
     parser.add_argument("--model", type=str, default="hindi_english", choices=["hindi_english", "telugu"],
                         help="Model variant (default: hindi_english)")
     parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
-    parser.add_argument("--language", type=str, default=None, help="Language code (en, hi, te)")
     parser.add_argument("--output", type=str, default="output_pip.wav", help="Output wav file path")
     args = parser.parse_args()
@@ -38,7 +38,7 @@ def main():
         args.text = texts[args.model]
     if args.language is None:
-        langs = {"hindi_english": "en", "telugu": "te"}
         args.language = langs[args.model]
     # List models

     parser.add_argument("--model", type=str, default="hindi_english", choices=["hindi_english", "telugu"],
                         help="Model variant (default: hindi_english)")
     parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
+    parser.add_argument("--language", type=str, default=None, help="Language code (en-us, hi, te)")
     parser.add_argument("--output", type=str, default="output_pip.wav", help="Output wav file path")
     args = parser.parse_args()
         args.text = texts[args.model]
     if args.language is None:
+        langs = {"hindi_english": "en-us", "telugu": "te"}
         args.language = langs[args.model]
     # List models

examples/torchhub_example.py CHANGED Viewed

@@ -23,7 +23,7 @@ def main():
     parser.add_argument("--variant", type=str, default="default", choices=["default", "telugu", "hindi_english"],
                         help="Model variant (default, telugu, hindi_english)")
     parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
-    parser.add_argument("--language", type=str, default=None, help="Language code (en, hi, te)")
     parser.add_argument("--output", type=str, default="output_torchhub.wav", help="Output wav file path")
     args = parser.parse_args()
@@ -38,7 +38,7 @@ def main():
         if args.variant == "telugu":
             args.language = "te"
         else:
-            args.language = "en"
     # Load via torch.hub
     # Available entry points:

     parser.add_argument("--variant", type=str, default="default", choices=["default", "telugu", "hindi_english"],
                         help="Model variant (default, telugu, hindi_english)")
     parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
+    parser.add_argument("--language", type=str, default=None, help="Language code (en-us, hi, te)")
     parser.add_argument("--output", type=str, default="output_torchhub.wav", help="Output wav file path")
     args = parser.parse_args()
         if args.variant == "telugu":
             args.language = "te"
         else:
+            args.language = "en-us"
     # Load via torch.hub
     # Available entry points:

hubconf.py CHANGED Viewed

@@ -14,7 +14,7 @@ Usage:
     wav = tts.synthesize(
         text="Hello, world!",
         reference_audio="path/to/reference.wav",
-        language="en"
     )
 """
@@ -50,7 +50,7 @@ def chiluka(pretrained: bool = True, device: str = None, **kwargs):
     Example:
         >>> import torch
         >>> tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
-        >>> wav = tts.synthesize("Hello!", "reference.wav", language="en")
     """
     from chiluka import Chiluka

     wav = tts.synthesize(
         text="Hello, world!",
         reference_audio="path/to/reference.wav",
+        language="en-us"
     )
 """
     Example:
         >>> import torch
         >>> tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
+        >>> wav = tts.synthesize("Hello!", "reference.wav", language="en-us")
     """
     from chiluka import Chiluka