seemanthraju commited on
Commit
393129e
·
1 Parent(s): 7189a0b

Added streaming funciton

Browse files
.gitignore CHANGED
@@ -80,3 +80,6 @@ test_outputs/
80
  # Large checkpoint files (hosted on Hugging Face: https://huggingface.co/Seemanth/chiluka)
81
  chiluka/checkpoints/epoch_2nd_00017.pth
82
  chiluka/checkpoints/epoch_2nd_00029.pth
 
 
 
 
80
  # Large checkpoint files (hosted on Hugging Face: https://huggingface.co/Seemanth/chiluka)
81
  chiluka/checkpoints/epoch_2nd_00017.pth
82
  chiluka/checkpoints/epoch_2nd_00029.pth
83
+
84
+ # Deploy commands (local only)
85
+ DEPLOY.md
MODEL_CARD.md CHANGED
@@ -21,75 +21,49 @@ tags:
21
 
22
  # Chiluka TTS
23
 
24
- **Chiluka** (చిలుక - Telugu for "parrot") is a lightweight, self-contained Text-to-Speech inference package based on [StyleTTS2](https://github.com/yl4579/StyleTTS2).
25
-
26
- It supports **style transfer from reference audio** - give it a voice sample and it will speak in that style.
27
 
28
  ## Available Models
29
 
30
- | Model | Name | Languages | Speakers | Description |
31
- |-------|------|-----------|----------|-------------|
32
- | **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
33
- | **Telugu** | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |
34
 
35
  ## Installation
36
 
37
- ```bash
38
- pip install chiluka
39
- ```
40
-
41
- Or from GitHub:
42
-
43
  ```bash
44
  pip install git+https://github.com/PurviewVoiceBot/chiluka.git
45
- ```
46
 
47
- **System dependency** (required for phonemization):
48
-
49
- ```bash
50
- # Ubuntu/Debian
51
- sudo apt-get install espeak-ng
52
-
53
- # macOS
54
- brew install espeak-ng
55
  ```
56
 
57
- ## Quick Start
 
 
58
 
59
  ```python
60
  from chiluka import Chiluka
61
 
62
- # Load model (weights download automatically on first use)
63
  tts = Chiluka.from_pretrained()
64
 
65
- # Synthesize speech
 
 
66
  wav = tts.synthesize(
67
  text="Hello, this is Chiluka speaking!",
68
  reference_audio="path/to/reference.wav",
69
- language="en"
70
  )
71
-
72
- # Save output
73
  tts.save_wav(wav, "output.wav")
74
  ```
75
 
76
- ## Choose a Model
77
-
78
- ```python
79
- from chiluka import Chiluka
80
-
81
- # Hindi + English (default)
82
- tts = Chiluka.from_pretrained(model="hindi_english")
83
-
84
- # Telugu + English
85
- tts = Chiluka.from_pretrained(model="telugu")
86
- ```
87
-
88
- ## Hindi Example
89
 
90
  ```python
91
  tts = Chiluka.from_pretrained()
92
-
93
  wav = tts.synthesize(
94
  text="नमस्ते, मैं चिलुका बोल रहा हूं",
95
  reference_audio="reference.wav",
@@ -98,11 +72,10 @@ wav = tts.synthesize(
98
  tts.save_wav(wav, "hindi_output.wav")
99
  ```
100
 
101
- ## Telugu Example
102
 
103
  ```python
104
  tts = Chiluka.from_pretrained(model="telugu")
105
-
106
  wav = tts.synthesize(
107
  text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
108
  reference_audio="reference.wav",
@@ -111,44 +84,49 @@ wav = tts.synthesize(
111
  tts.save_wav(wav, "telugu_output.wav")
112
  ```
113
 
114
- ## PyTorch Hub
 
 
115
 
116
  ```python
117
- import torch
118
 
119
- # Hindi-English (default)
120
- tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
 
 
121
 
122
- # Telugu
123
- tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')
 
124
 
125
- wav = tts.synthesize("Hello!", "reference.wav", language="en")
 
 
126
  ```
127
 
128
- ## Synthesis Parameters
129
 
130
  | Parameter | Default | Description |
131
  |-----------|---------|-------------|
132
  | `text` | required | Input text to synthesize |
133
  | `reference_audio` | required | Path to reference audio for voice style |
134
- | `language` | `"en"` | Language code (`en`, `hi`, `te`, etc.) |
135
- | `alpha` | `0.3` | Acoustic style mixing (0 = reference voice, 1 = predicted) |
136
- | `beta` | `0.7` | Prosodic style mixing (0 = reference prosody, 1 = predicted) |
137
- | `diffusion_steps` | `5` | More steps = better quality, slower inference |
138
  | `embedding_scale` | `1.0` | Classifier-free guidance strength |
139
 
140
- ## How It Works
141
 
142
- Chiluka uses a StyleTTS2-based pipeline:
 
 
 
 
 
143
 
144
- 1. **Text** is converted to phonemes using espeak-ng
145
- 2. **PL-BERT** encodes text into contextual embeddings
146
- 3. **Reference audio** is processed to extract a style vector
147
- 4. **Diffusion model** samples a style conditioned on text
148
- 5. **Prosody predictor** generates duration, pitch (F0), and energy
149
- 6. **HiFi-GAN decoder** synthesizes the final waveform at 24kHz
150
-
151
- ## Model Architecture
152
 
153
  - **Text Encoder**: Token embedding + CNN + BiLSTM
154
  - **Style Encoder**: Conv2D + Residual blocks (style_dim=128)
@@ -157,42 +135,13 @@ Chiluka uses a StyleTTS2-based pipeline:
157
  - **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
158
  - **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch)
159
 
160
- ## File Structure
161
-
162
- ```
163
- ├── configs/
164
- │ ├── config_ft.yml # Telugu model config
165
- │ └── config_hindi_english.yml # Hindi-English model config
166
- ├── checkpoints/
167
- │ ├── epoch_2nd_00017.pth # Telugu checkpoint (~2GB)
168
- │ └── epoch_2nd_00029.pth # Hindi-English checkpoint (~2GB)
169
- ├── pretrained/ # Shared pretrained sub-models
170
- │ ├── ASR/ # Text-to-mel alignment
171
- │ ├── JDC/ # Pitch extraction (F0)
172
- │ └── PLBERT/ # Text encoder
173
- ├── models/ # Model architecture code
174
- │ ├── core.py
175
- │ ├── hifigan.py
176
- │ └── diffusion/
177
- ├── inference.py # Main API
178
- ├── hub.py # HuggingFace Hub utilities
179
- └── text_utils.py # Phoneme tokenization
180
- ```
181
-
182
  ## Requirements
183
 
184
  - Python >= 3.8
185
  - PyTorch >= 1.13.0
186
- - CUDA recommended (works on CPU too)
187
- - espeak-ng system package
188
-
189
- ## Limitations
190
-
191
- - Requires a reference audio file for style/voice transfer
192
- - Quality depends on the reference audio quality
193
- - Best results with 3-15 second reference clips
194
- - Hindi-English model trained on 5 speakers
195
- - Telugu model trained on 1 speaker
196
 
197
  ## Citation
198
 
@@ -214,4 +163,4 @@ MIT License
214
  ## Links
215
 
216
  - **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
217
- - **PyPI**: [chiluka](https://pypi.org/project/chiluka/)
 
21
 
22
  # Chiluka TTS
23
 
24
+ **Chiluka** (చిలుక - Telugu for "parrot") is a lightweight Text-to-Speech model based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) with style transfer from reference audio.
 
 
25
 
26
  ## Available Models
27
 
28
+ | Model | Name | Languages | Speakers |
29
+ |-------|------|-----------|----------|
30
+ | **Hindi-English** (default) | `hindi_english` | Hindi, English | 5 |
31
+ | **Telugu** | `telugu` | Telugu, English | 1 |
32
 
33
  ## Installation
34
 
 
 
 
 
 
 
35
  ```bash
36
  pip install git+https://github.com/PurviewVoiceBot/chiluka.git
 
37
 
38
+ # Required system dependency
39
+ sudo apt-get install espeak-ng # Ubuntu/Debian
 
 
 
 
 
 
40
  ```
41
 
42
+ ## Usage
43
+
44
+ Model weights download automatically on first use.
45
 
46
  ```python
47
  from chiluka import Chiluka
48
 
49
+ # Load Hindi-English model (default)
50
  tts = Chiluka.from_pretrained()
51
 
52
+ # Or Telugu model
53
+ # tts = Chiluka.from_pretrained(model="telugu")
54
+
55
  wav = tts.synthesize(
56
  text="Hello, this is Chiluka speaking!",
57
  reference_audio="path/to/reference.wav",
58
+ language="en-us"
59
  )
 
 
60
  tts.save_wav(wav, "output.wav")
61
  ```
62
 
63
+ ### Hindi
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ```python
66
  tts = Chiluka.from_pretrained()
 
67
  wav = tts.synthesize(
68
  text="नमस्ते, मैं चिलुका बोल रहा हूं",
69
  reference_audio="reference.wav",
 
72
  tts.save_wav(wav, "hindi_output.wav")
73
  ```
74
 
75
+ ### Telugu
76
 
77
  ```python
78
  tts = Chiluka.from_pretrained(model="telugu")
 
79
  wav = tts.synthesize(
80
  text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
81
  reference_audio="reference.wav",
 
84
  tts.save_wav(wav, "telugu_output.wav")
85
  ```
86
 
87
+ ## Streaming Audio
88
+
89
+ For WebRTC, WebSocket, or HTTP streaming:
90
 
91
  ```python
92
+ wav = tts.synthesize("Hello!", "reference.wav", language="en-us")
93
 
94
+ # Get audio as bytes (no disk write)
95
+ mp3_bytes = tts.to_audio_bytes(wav, format="mp3") # requires pydub + ffmpeg
96
+ wav_bytes = tts.to_audio_bytes(wav, format="wav")
97
+ pcm_bytes = tts.to_audio_bytes(wav, format="pcm") # raw 16-bit PCM
98
 
99
+ # Stream chunked audio
100
+ for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
101
+ websocket.send(chunk) # PCM chunks by default
102
 
103
+ # Stream as MP3 chunks
104
+ for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
105
+ response.write(chunk)
106
  ```
107
 
108
+ ## Parameters
109
 
110
  | Parameter | Default | Description |
111
  |-----------|---------|-------------|
112
  | `text` | required | Input text to synthesize |
113
  | `reference_audio` | required | Path to reference audio for voice style |
114
+ | `language` | `"en-us"` | espeak-ng language code (see below) |
115
+ | `alpha` | `0.3` | Acoustic style mixing (0 = reference, 1 = predicted) |
116
+ | `beta` | `0.7` | Prosodic style mixing (0 = reference, 1 = predicted) |
117
+ | `diffusion_steps` | `5` | More steps = better quality, slower |
118
  | `embedding_scale` | `1.0` | Classifier-free guidance strength |
119
 
120
+ ## Language Codes
121
 
122
+ | Language | Code | Available In |
123
+ |----------|------|-------------|
124
+ | English (US) | `en-us` | All models |
125
+ | English (UK) | `en-gb` | All models |
126
+ | Hindi | `hi` | `hindi_english` |
127
+ | Telugu | `te` | `telugu` |
128
 
129
+ ## Architecture
 
 
 
 
 
 
 
130
 
131
  - **Text Encoder**: Token embedding + CNN + BiLSTM
132
  - **Style Encoder**: Conv2D + Residual blocks (style_dim=128)
 
135
  - **Decoder**: HiFi-GAN vocoder (upsample rates: 10, 5, 3, 2)
136
  - **Pretrained sub-models**: PL-BERT (text), ASR (alignment), JDC (pitch)
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  ## Requirements
139
 
140
  - Python >= 3.8
141
  - PyTorch >= 1.13.0
142
+ - CUDA recommended
143
+ - espeak-ng
144
+ - pydub + ffmpeg (only for MP3/OGG streaming)
 
 
 
 
 
 
 
145
 
146
  ## Citation
147
 
 
163
  ## Links
164
 
165
  - **GitHub**: [PurviewVoiceBot/chiluka](https://github.com/PurviewVoiceBot/chiluka)
166
+ - **HuggingFace**: [Seemanth/chiluka](https://huggingface.co/Seemanth/chiluka)
README.md CHANGED
@@ -1,14 +1,6 @@
1
  # Chiluka
2
 
3
- **Chiluka** (చిలుక - Telugu for "parrot") is a self-contained TTS (Text-to-Speech) inference package based on StyleTTS2.
4
-
5
- ## Features
6
-
7
- - Simple, clean API for TTS synthesis
8
- - Style transfer from reference audio
9
- - Multi-language support via phonemizer
10
- - **Multiple models** - Hindi-English and Telugu
11
- - **Multiple ways to load** - HuggingFace Hub, PyTorch Hub, pip install
12
 
13
  ## Available Models
14
 
@@ -17,29 +9,15 @@
17
  | Hindi-English (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
18
  | Telugu | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |
19
 
20
- ## Installation
21
-
22
- ### Option 1: pip install
23
 
24
- ```bash
25
- pip install chiluka
26
- ```
27
-
28
- ### Option 2: Install from GitHub
29
 
30
  ```bash
31
  pip install git+https://github.com/PurviewVoiceBot/chiluka.git
32
  ```
33
 
34
- ### Option 3: From Source
35
-
36
- ```bash
37
- git clone https://github.com/PurviewVoiceBot/chiluka.git
38
- cd chiluka
39
- pip install -e .
40
- ```
41
-
42
- ### System Dependency: espeak-ng (Required)
43
 
44
  ```bash
45
  # Ubuntu/Debian
@@ -51,10 +29,6 @@ brew install espeak-ng
51
 
52
  ## Quick Start
53
 
54
- ### HuggingFace Hub (Recommended)
55
-
56
- Model weights download automatically on first use. No cloning needed.
57
-
58
  ```python
59
  from chiluka import Chiluka
60
 
@@ -65,7 +39,7 @@ tts = Chiluka.from_pretrained()
65
  wav = tts.synthesize(
66
  text="Hello, this is Chiluka speaking!",
67
  reference_audio="path/to/reference.wav",
68
- language="en"
69
  )
70
 
71
  # Save to file
@@ -75,8 +49,6 @@ tts.save_wav(wav, "output.wav")
75
  ### Load a Specific Model
76
 
77
  ```python
78
- from chiluka import Chiluka
79
-
80
  # Hindi-English (default)
81
  tts = Chiluka.from_pretrained(model="hindi_english")
82
 
@@ -84,111 +56,92 @@ tts = Chiluka.from_pretrained(model="hindi_english")
84
  tts = Chiluka.from_pretrained(model="telugu")
85
  ```
86
 
87
- ### PyTorch Hub
88
-
89
- ```python
90
- import torch
91
-
92
- # Hindi-English (default)
93
- tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
94
-
95
- # Telugu
96
- tts = torch.hub.load('Seemanth/chiluka', 'chiluka_telugu')
97
-
98
- # Synthesize
99
- wav = tts.synthesize(
100
- text="Hello from PyTorch Hub!",
101
- reference_audio="reference.wav",
102
- language="en"
103
- )
104
- ```
105
-
106
- ### Local Weights (if you cloned with Git LFS)
107
-
108
- ```python
109
- from chiluka import Chiluka
110
-
111
- tts = Chiluka() # uses bundled weights from cloned repo
112
- ```
113
-
114
  ## Examples
115
 
116
- ### Hindi Synthesis
117
 
118
  ```python
119
- from chiluka import Chiluka
120
-
121
- tts = Chiluka.from_pretrained(model="hindi_english")
122
 
123
  wav = tts.synthesize(
124
  text="नमस्ते, मैं चिलुका बोल रहा हूं",
125
- reference_audio="hindi_reference.wav",
126
  language="hi"
127
  )
128
  tts.save_wav(wav, "hindi_output.wav")
129
  ```
130
 
131
- ### English Synthesis
132
 
133
  ```python
134
  wav = tts.synthesize(
135
  text="Hello, I am Chiluka, a text to speech system.",
136
- reference_audio="english_reference.wav",
137
- language="en"
138
  )
139
  tts.save_wav(wav, "english_output.wav")
140
  ```
141
 
142
- ### Telugu Synthesis
143
 
144
  ```python
145
- from chiluka import Chiluka
146
-
147
  tts = Chiluka.from_pretrained(model="telugu")
148
 
149
  wav = tts.synthesize(
150
  text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
151
- reference_audio="telugu_reference.wav",
152
  language="te"
153
  )
154
  tts.save_wav(wav, "telugu_output.wav")
155
  ```
156
 
157
- ### List Available Models
 
 
 
 
158
 
159
  ```python
160
- from chiluka import list_models
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
 
162
- models = list_models()
163
- for name, info in models.items():
164
- print(f"{name}: {info['description']} ({', '.join(info['languages'])})")
165
  ```
166
 
167
  ## API Reference
168
 
169
- ### Loading the Model
170
 
171
  ```python
172
- # Auto-download from HuggingFace (recommended)
173
- tts = Chiluka.from_pretrained() # Hindi-English (default)
174
- tts = Chiluka.from_pretrained(model="telugu") # Telugu
175
- tts = Chiluka.from_pretrained(model="hindi_english") # Hindi-English (explicit)
176
-
177
- # With options
178
  tts = Chiluka.from_pretrained(
179
- model="hindi_english", # Model variant
180
- repo_id="Seemanth/chiluka", # HuggingFace repo
181
- device="cuda", # or "cpu"
182
- force_download=False, # Re-download even if cached
183
- token="hf_xxx" # For private repos
184
- )
185
-
186
- # Local weights
187
- tts = Chiluka(
188
- config_path="path/to/config.yml",
189
- checkpoint_path="path/to/model.pth",
190
- pretrained_dir="path/to/pretrained/",
191
- device="cuda"
192
  )
193
  ```
194
 
@@ -198,7 +151,7 @@ tts = Chiluka(
198
  wav = tts.synthesize(
199
  text="Hello world", # Text to synthesize
200
  reference_audio="ref.wav", # Reference audio for style
201
- language="en", # Language code
202
  alpha=0.3, # Acoustic style mixing (0-1)
203
  beta=0.7, # Prosodic style mixing (0-1)
204
  diffusion_steps=5, # Quality vs speed tradeoff
@@ -207,17 +160,37 @@ wav = tts.synthesize(
207
  )
208
  ```
209
 
210
- ### Other Methods
211
 
212
  ```python
213
- # Save audio to file
214
- tts.save_wav(wav, "output.wav", sr=24000)
 
 
 
 
 
215
 
216
- # Play audio (requires pyaudio)
217
- tts.play(wav, sr=24000)
218
 
219
- # Get style embedding from audio
220
- style = tts.compute_style("reference.wav", sr=24000)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
  ```
222
 
223
  ## Synthesis Parameters
@@ -229,9 +202,9 @@ style = tts.compute_style("reference.wav", sr=24000)
229
  | `diffusion_steps` | 5 | Diffusion sampling steps (more = better quality, slower) |
230
  | `embedding_scale` | 1.0 | Classifier-free guidance scale |
231
 
232
- ## Supported Languages
233
 
234
- Uses [phonemizer](https://github.com/bootphon/phonemizer) with espeak-ng:
235
 
236
  | Language | Code | Available In |
237
  |----------|------|-------------|
@@ -239,86 +212,14 @@ Uses [phonemizer](https://github.com/bootphon/phonemizer) with espeak-ng:
239
  | English (UK) | `en-gb` | All models |
240
  | Hindi | `hi` | `hindi_english` |
241
  | Telugu | `te` | `telugu` |
242
- | Tamil | `ta` | With fine-tuning |
243
- | Kannada | `kn` | With fine-tuning |
244
-
245
- ## Hub Utilities
246
-
247
- ```python
248
- from chiluka import list_models, clear_cache, push_to_hub, get_cache_dir
249
-
250
- # List available models
251
- list_models()
252
-
253
- # Clear cache
254
- clear_cache() # Clear all
255
- clear_cache("Seemanth/chiluka") # Clear specific repo
256
-
257
- # Push your own model to HuggingFace
258
- push_to_hub(
259
- local_dir="./my-model",
260
- repo_id="myusername/my-chiluka-model",
261
- token="hf_your_token"
262
- )
263
-
264
- # Check cache location
265
- print(get_cache_dir()) # ~/.cache/chiluka
266
- ```
267
-
268
- ## Environment Variables
269
-
270
- | Variable | Description |
271
- |----------|-------------|
272
- | `CHILUKA_CACHE` | Custom cache directory (default: `~/.cache/chiluka`) |
273
- | `HF_TOKEN` | HuggingFace API token for private repos |
274
 
275
  ## Requirements
276
 
277
  - Python >= 3.8
278
  - PyTorch >= 1.13.0
279
- - CUDA (recommended for faster inference)
280
  - espeak-ng
281
-
282
- ## Package Structure
283
-
284
- ```
285
- chiluka/
286
- ├── chiluka/
287
- │ ├── __init__.py
288
- │ ├── inference.py # Main Chiluka API
289
- │ ├── hub.py # Hub download + model registry
290
- │ ├── text_utils.py
291
- │ ├── utils.py
292
- │ ├── configs/
293
- │ │ ├── config_ft.yml # Telugu model config
294
- │ │ └── config_hindi_english.yml # Hindi-English model config
295
- │ ├── checkpoints/
296
- │ │ ���── epoch_2nd_00017.pth # Telugu checkpoint
297
- │ │ └── epoch_2nd_00029.pth # Hindi-English checkpoint
298
- │ ├── pretrained/ # Shared pretrained sub-models
299
- │ │ ├── ASR/
300
- │ │ ├── JDC/
301
- │ │ └── PLBERT/
302
- │ └── models/
303
- ├── hubconf.py # PyTorch Hub config
304
- ├── examples/
305
- │ ├── basic_synthesis.py
306
- │ ├── telugu_synthesis.py
307
- │ ├── huggingface_example.py
308
- │ ├── torchhub_example.py
309
- │ └── pip_example.py
310
- ├── setup.py
311
- └── README.md
312
- ```
313
-
314
- ## Training Your Own Model
315
-
316
- This package is for **inference only**. To train your own model, use the original [StyleTTS2](https://github.com/yl4579/StyleTTS2) repository.
317
-
318
- After training:
319
- 1. Copy your checkpoint and config to a directory
320
- 2. Push to HuggingFace Hub using `push_to_hub()`
321
- 3. Load with `Chiluka.from_pretrained("your-repo")`
322
 
323
  ## Credits
324
 
 
1
  # Chiluka
2
 
3
+ **Chiluka** (చిలుక - Telugu for "parrot") is a lightweight TTS (Text-to-Speech) inference package based on StyleTTS2 with style transfer from reference audio.
 
 
 
 
 
 
 
 
4
 
5
  ## Available Models
6
 
 
9
  | Hindi-English (default) | `hindi_english` | Hindi, English | 5 | Multi-speaker Hindi + English TTS |
10
  | Telugu | `telugu` | Telugu, English | 1 | Single-speaker Telugu + English TTS |
11
 
12
+ Model weights are hosted on [HuggingFace](https://huggingface.co/Seemanth/chiluka) and downloaded automatically on first use.
 
 
13
 
14
+ ## Installation
 
 
 
 
15
 
16
  ```bash
17
  pip install git+https://github.com/PurviewVoiceBot/chiluka.git
18
  ```
19
 
20
+ System dependency (required):
 
 
 
 
 
 
 
 
21
 
22
  ```bash
23
  # Ubuntu/Debian
 
29
 
30
  ## Quick Start
31
 
 
 
 
 
32
  ```python
33
  from chiluka import Chiluka
34
 
 
39
  wav = tts.synthesize(
40
  text="Hello, this is Chiluka speaking!",
41
  reference_audio="path/to/reference.wav",
42
+ language="en-us"
43
  )
44
 
45
  # Save to file
 
49
  ### Load a Specific Model
50
 
51
  ```python
 
 
52
  # Hindi-English (default)
53
  tts = Chiluka.from_pretrained(model="hindi_english")
54
 
 
56
  tts = Chiluka.from_pretrained(model="telugu")
57
  ```
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ## Examples
60
 
61
+ ### Hindi
62
 
63
  ```python
64
+ tts = Chiluka.from_pretrained()
 
 
65
 
66
  wav = tts.synthesize(
67
  text="नमस्ते, मैं चिलुका बोल रहा हूं",
68
+ reference_audio="reference.wav",
69
  language="hi"
70
  )
71
  tts.save_wav(wav, "hindi_output.wav")
72
  ```
73
 
74
+ ### English
75
 
76
  ```python
77
  wav = tts.synthesize(
78
  text="Hello, I am Chiluka, a text to speech system.",
79
+ reference_audio="reference.wav",
80
+ language="en-us"
81
  )
82
  tts.save_wav(wav, "english_output.wav")
83
  ```
84
 
85
+ ### Telugu
86
 
87
  ```python
 
 
88
  tts = Chiluka.from_pretrained(model="telugu")
89
 
90
  wav = tts.synthesize(
91
  text="నమస్కారం, నేను చిలుక మాట్లాడుతున్నాను",
92
+ reference_audio="reference.wav",
93
  language="te"
94
  )
95
  tts.save_wav(wav, "telugu_output.wav")
96
  ```
97
 
98
+ ## Streaming Audio
99
+
100
+ For real-time applications (WebRTC, WebSocket, HTTP streaming), Chiluka can generate audio as bytes or chunked streams without writing to disk.
101
+
102
+ ### Get Audio Bytes
103
 
104
  ```python
105
+ wav = tts.synthesize("Hello!", "reference.wav", language="en-us")
106
+
107
+ # WAV bytes
108
+ wav_bytes = tts.to_audio_bytes(wav, format="wav")
109
+
110
+ # MP3 bytes (requires: pip install pydub, and ffmpeg installed)
111
+ mp3_bytes = tts.to_audio_bytes(wav, format="mp3")
112
+
113
+ # Raw PCM bytes (16-bit signed int, for WebRTC)
114
+ pcm_bytes = tts.to_audio_bytes(wav, format="pcm")
115
+
116
+ # OGG bytes
117
+ ogg_bytes = tts.to_audio_bytes(wav, format="ogg")
118
+ ```
119
+
120
+ ### Stream Audio Chunks
121
+
122
+ ```python
123
+ # Stream PCM chunks over WebSocket
124
+ for chunk in tts.synthesize_stream("Hello!", "reference.wav", language="en-us"):
125
+ websocket.send(chunk)
126
+
127
+ # Stream MP3 chunks for HTTP response
128
+ for chunk in tts.synthesize_stream("Hello!", "reference.wav", format="mp3"):
129
+ response.write(chunk)
130
 
131
+ # Custom chunk size (default 4800 samples = 200ms at 24kHz)
132
+ for chunk in tts.synthesize_stream("Hello!", "reference.wav", chunk_size=2400):
133
+ process(chunk)
134
  ```
135
 
136
  ## API Reference
137
 
138
+ ### Chiluka.from_pretrained()
139
 
140
  ```python
 
 
 
 
 
 
141
  tts = Chiluka.from_pretrained(
142
+ model="hindi_english", # "hindi_english" or "telugu"
143
+ device="cuda", # "cuda" or "cpu" (auto-detects if None)
144
+ force_download=False, # Re-download even if cached
 
 
 
 
 
 
 
 
 
 
145
  )
146
  ```
147
 
 
151
  wav = tts.synthesize(
152
  text="Hello world", # Text to synthesize
153
  reference_audio="ref.wav", # Reference audio for style
154
+ language="en-us", # Language code
155
  alpha=0.3, # Acoustic style mixing (0-1)
156
  beta=0.7, # Prosodic style mixing (0-1)
157
  diffusion_steps=5, # Quality vs speed tradeoff
 
160
  )
161
  ```
162
 
163
+ ### to_audio_bytes()
164
 
165
  ```python
166
+ audio_bytes = tts.to_audio_bytes(
167
+ wav, # Numpy array from synthesize()
168
+ format="mp3", # "wav", "mp3", "ogg", "flac", "pcm"
169
+ sr=24000, # Sample rate
170
+ bitrate="128k" # Bitrate for mp3/ogg
171
+ )
172
+ ```
173
 
174
+ ### synthesize_stream()
 
175
 
176
+ ```python
177
+ for chunk in tts.synthesize_stream(
178
+ text="Hello world", # Text to synthesize
179
+ reference_audio="ref.wav", # Reference audio for style
180
+ language="en-us", # Language code
181
+ format="pcm", # "pcm", "wav", "mp3", "ogg"
182
+ chunk_size=4800, # Samples per chunk (200ms at 24kHz)
183
+ sr=24000, # Sample rate
184
+ ):
185
+ process(chunk)
186
+ ```
187
+
188
+ ### Other Methods
189
+
190
+ ```python
191
+ tts.save_wav(wav, "output.wav") # Save to WAV file
192
+ tts.play(wav) # Play via speakers (requires pyaudio)
193
+ style = tts.compute_style("reference.wav") # Get style embedding
194
  ```
195
 
196
  ## Synthesis Parameters
 
202
  | `diffusion_steps` | 5 | Diffusion sampling steps (more = better quality, slower) |
203
  | `embedding_scale` | 1.0 | Classifier-free guidance scale |
204
 
205
+ ## Language Codes
206
 
207
+ These are espeak-ng language codes passed to the `language` parameter:
208
 
209
  | Language | Code | Available In |
210
  |----------|------|-------------|
 
212
  | English (UK) | `en-gb` | All models |
213
  | Hindi | `hi` | `hindi_english` |
214
  | Telugu | `te` | `telugu` |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
  ## Requirements
217
 
218
  - Python >= 3.8
219
  - PyTorch >= 1.13.0
220
+ - CUDA (recommended)
221
  - espeak-ng
222
+ - pydub + ffmpeg (only for MP3/OGG streaming)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
 
224
  ## Credits
225
 
README_HF.md DELETED
@@ -1,92 +0,0 @@
1
- ---
2
- language:
3
- - en
4
- - te
5
- - hi
6
- license: mit
7
- library_name: chiluka
8
- tags:
9
- - text-to-speech
10
- - tts
11
- - styletts2
12
- - voice-cloning
13
- ---
14
-
15
- # Chiluka TTS
16
-
17
- Chiluka (చిలుక - Telugu for "parrot") is a lightweight Text-to-Speech model based on StyleTTS2.
18
-
19
- ## Installation
20
-
21
- ```bash
22
- pip install chiluka
23
- ```
24
-
25
- Or install from source:
26
-
27
- ```bash
28
- pip install git+https://github.com/Seemanth/chiluka.git
29
- ```
30
-
31
- ## Usage
32
-
33
- ### Quick Start (Auto-download)
34
-
35
- ```python
36
- from chiluka import Chiluka
37
-
38
- # Automatically downloads model weights
39
- tts = Chiluka.from_pretrained()
40
-
41
- # Generate speech
42
- wav = tts.synthesize(
43
- text="Hello, world!",
44
- reference_audio="path/to/reference.wav",
45
- language="en"
46
- )
47
-
48
- # Save output
49
- tts.save_wav(wav, "output.wav")
50
- ```
51
-
52
- ### PyTorch Hub
53
-
54
- ```python
55
- import torch
56
-
57
- tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
58
- wav = tts.synthesize("Hello!", "reference.wav", language="en")
59
- ```
60
-
61
- ### HuggingFace Hub
62
-
63
- ```python
64
- from chiluka import Chiluka
65
-
66
- tts = Chiluka.from_pretrained("Seemanth/chiluka")
67
- ```
68
-
69
- ## Parameters
70
-
71
- - `text`: Input text to synthesize
72
- - `reference_audio`: Path to reference audio for style transfer
73
- - `language`: Language code ('en', 'te', 'hi', etc.)
74
- - `alpha`: Acoustic style mixing (0-1, default 0.3)
75
- - `beta`: Prosodic style mixing (0-1, default 0.7)
76
- - `diffusion_steps`: Quality vs speed tradeoff (default 5)
77
-
78
- ## Supported Languages
79
-
80
- Uses espeak-ng phonemizer. Common languages:
81
- - English: `en-us`, `en-gb`
82
- - Telugu: `te`
83
- - Hindi: `hi`
84
- - Tamil: `ta`
85
-
86
- ## License
87
-
88
- MIT License
89
-
90
- ## Citation
91
-
92
- Based on StyleTTS2 by Yinghao Aaron Li et al.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
chiluka/__init__.py CHANGED
@@ -17,7 +17,7 @@ Usage:
17
  wav = tts.synthesize(
18
  text="Hello, world!",
19
  reference_audio="reference.wav",
20
- language="en"
21
  )
22
  tts.save_wav(wav, "output.wav")
23
  """
 
17
  wav = tts.synthesize(
18
  text="Hello, world!",
19
  reference_audio="reference.wav",
20
+ language="en-us"
21
  )
22
  tts.save_wav(wav, "output.wav")
23
  """
chiluka/hub.py CHANGED
@@ -318,7 +318,7 @@ tts = Chiluka.from_pretrained()
318
  wav = tts.synthesize(
319
  text="Hello, world!",
320
  reference_audio="reference.wav",
321
- language="en"
322
  )
323
  tts.save_wav(wav, "output.wav")
324
  ```
 
318
  wav = tts.synthesize(
319
  text="Hello, world!",
320
  reference_audio="reference.wav",
321
+ language="en-us"
322
  )
323
  tts.save_wav(wav, "output.wav")
324
  ```
chiluka/inference.py CHANGED
@@ -11,13 +11,14 @@ Example usage:
11
  wav = tts.synthesize(
12
  text="Hello, world!",
13
  reference_audio="path/to/reference.wav",
14
- language="en"
15
  )
16
 
17
  # Save to file
18
  tts.save_wav(wav, "output.wav")
19
  """
20
 
 
21
  import os
22
  import yaml
23
  import torch
@@ -25,7 +26,7 @@ import torchaudio
25
  import librosa
26
  import numpy as np
27
  from pathlib import Path
28
- from typing import Optional, Union
29
 
30
  from nltk.tokenize import word_tokenize
31
 
@@ -291,7 +292,7 @@ class Chiluka:
291
  self,
292
  text: str,
293
  reference_audio: str,
294
- language: str = "en",
295
  alpha: float = 0.3,
296
  beta: float = 0.7,
297
  diffusion_steps: int = 5,
@@ -304,7 +305,7 @@ class Chiluka:
304
  Args:
305
  text: Input text to synthesize
306
  reference_audio: Path to reference audio for style transfer
307
- language: Language code for phonemization (e.g., 'en', 'te', 'hi')
308
  alpha: Style mixing coefficient for acoustic features (0-1)
309
  beta: Style mixing coefficient for prosodic features (0-1)
310
  diffusion_steps: Number of diffusion sampling steps
@@ -432,3 +433,129 @@ class Chiluka:
432
  stream.stop_stream()
433
  stream.close()
434
  p.terminate()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  wav = tts.synthesize(
12
  text="Hello, world!",
13
  reference_audio="path/to/reference.wav",
14
+ language="en-us"
15
  )
16
 
17
  # Save to file
18
  tts.save_wav(wav, "output.wav")
19
  """
20
 
21
+ import io
22
  import os
23
  import yaml
24
  import torch
 
26
  import librosa
27
  import numpy as np
28
  from pathlib import Path
29
+ from typing import Optional, Union, Generator
30
 
31
  from nltk.tokenize import word_tokenize
32
 
 
292
  self,
293
  text: str,
294
  reference_audio: str,
295
+ language: str = "en-us",
296
  alpha: float = 0.3,
297
  beta: float = 0.7,
298
  diffusion_steps: int = 5,
 
305
  Args:
306
  text: Input text to synthesize
307
  reference_audio: Path to reference audio for style transfer
308
+ language: espeak-ng language code (e.g., 'en-us', 'hi', 'te')
309
  alpha: Style mixing coefficient for acoustic features (0-1)
310
  beta: Style mixing coefficient for prosodic features (0-1)
311
  diffusion_steps: Number of diffusion sampling steps
 
433
  stream.stop_stream()
434
  stream.close()
435
  p.terminate()
436
+
437
+ def to_audio_bytes(
438
+ self,
439
+ wav: np.ndarray,
440
+ format: str = "wav",
441
+ sr: int = 24000,
442
+ bitrate: str = "128k",
443
+ ) -> bytes:
444
+ """
445
+ Convert waveform to audio bytes in the specified format.
446
+
447
+ Useful for sending audio over HTTP, WebSocket, or WebRTC without
448
+ writing to disk.
449
+
450
+ Args:
451
+ wav: Audio waveform as numpy array (from synthesize())
452
+ format: Output format - "wav", "mp3", "ogg", "flac", "pcm"
453
+ sr: Sample rate
454
+ bitrate: Bitrate for compressed formats (mp3, ogg)
455
+
456
+ Returns:
457
+ Audio data as bytes
458
+
459
+ Examples:
460
+ >>> wav = tts.synthesize("Hello!", "ref.wav", language="en-us")
461
+
462
+ >>> # WAV bytes
463
+ >>> wav_bytes = tts.to_audio_bytes(wav, format="wav")
464
+
465
+ >>> # MP3 bytes (requires pydub + ffmpeg)
466
+ >>> mp3_bytes = tts.to_audio_bytes(wav, format="mp3")
467
+
468
+ >>> # Raw PCM bytes (16-bit signed int, for WebRTC)
469
+ >>> pcm_bytes = tts.to_audio_bytes(wav, format="pcm")
470
+ """
471
+ wav_int16 = (wav * 32767).clip(-32768, 32767).astype(np.int16)
472
+
473
+ if format == "pcm":
474
+ return wav_int16.tobytes()
475
+
476
+ if format == "wav":
477
+ buf = io.BytesIO()
478
+ import scipy.io.wavfile as wavfile
479
+ wavfile.write(buf, sr, wav_int16)
480
+ return buf.getvalue()
481
+
482
+ # mp3, ogg, flac - use pydub
483
+ try:
484
+ from pydub import AudioSegment
485
+ except ImportError:
486
+ raise ImportError(
487
+ f"pydub is required for '{format}' format. "
488
+ "Install with: pip install pydub\n"
489
+ "Also requires ffmpeg: sudo apt-get install ffmpeg"
490
+ )
491
+
492
+ segment = AudioSegment(
493
+ data=wav_int16.tobytes(),
494
+ sample_width=2,
495
+ frame_rate=sr,
496
+ channels=1,
497
+ )
498
+ buf = io.BytesIO()
499
+ segment.export(buf, format=format, bitrate=bitrate)
500
+ return buf.getvalue()
501
+
502
+ def synthesize_stream(
503
+ self,
504
+ text: str,
505
+ reference_audio: str,
506
+ language: str = "en-us",
507
+ format: str = "pcm",
508
+ chunk_size: int = 4800,
509
+ sr: int = 24000,
510
+ bitrate: str = "128k",
511
+ **synth_kwargs,
512
+ ) -> Generator[bytes, None, None]:
513
+ """
514
+ Synthesize speech and yield audio chunks for streaming.
515
+
516
+ Generates the full audio then yields it in chunks suitable for
517
+ real-time streaming over WebRTC, WebSocket, or HTTP chunked transfer.
518
+
519
+ Args:
520
+ text: Input text to synthesize
521
+ reference_audio: Path to reference audio for style transfer
522
+ language: Language code (e.g., "en-us", "hi", "te")
523
+ format: Output format per chunk - "pcm", "wav", "mp3", "ogg"
524
+ chunk_size: Number of samples per chunk (default 4800 = 200ms at 24kHz)
525
+ sr: Sample rate
526
+ bitrate: Bitrate for compressed formats
527
+ **synth_kwargs: Additional args passed to synthesize()
528
+ (alpha, beta, diffusion_steps, embedding_scale)
529
+
530
+ Yields:
531
+ Audio data chunks as bytes
532
+
533
+ Examples:
534
+ >>> # Stream PCM chunks over WebSocket
535
+ >>> for chunk in tts.synthesize_stream("Hello!", "ref.wav"):
536
+ ... websocket.send(chunk)
537
+
538
+ >>> # Stream MP3 chunks
539
+ >>> for chunk in tts.synthesize_stream("Hello!", "ref.wav", format="mp3"):
540
+ ... response.write(chunk)
541
+ """
542
+ wav = self.synthesize(
543
+ text=text,
544
+ reference_audio=reference_audio,
545
+ language=language,
546
+ sr=sr,
547
+ **synth_kwargs,
548
+ )
549
+
550
+ wav_int16 = (wav * 32767).clip(-32768, 32767).astype(np.int16)
551
+
552
+ if format == "pcm":
553
+ for i in range(0, len(wav_int16), chunk_size):
554
+ yield wav_int16[i:i + chunk_size].tobytes()
555
+ return
556
+
557
+ # For compressed formats, encode the full audio then chunk the bytes
558
+ audio_bytes = self.to_audio_bytes(wav, format=format, sr=sr, bitrate=bitrate)
559
+ byte_chunk_size = chunk_size * 4 # approximate byte size per chunk
560
+ for i in range(0, len(audio_bytes), byte_chunk_size):
561
+ yield audio_bytes[i:i + byte_chunk_size]
examples/basic_synthesis.py CHANGED
@@ -20,7 +20,7 @@ def main():
20
  parser = argparse.ArgumentParser(description="Chiluka TTS Synthesis")
21
  parser.add_argument("--reference", "-r", required=True, help="Path to reference audio file")
22
  parser.add_argument("--text", "-t", default="Hello, this is Chiluka speaking!", help="Text to synthesize")
23
- parser.add_argument("--language", "-l", default="en", help="Language code (en, te, hi, etc.)")
24
  parser.add_argument("--output", "-o", default="output.wav", help="Output WAV file path")
25
  parser.add_argument("--alpha", type=float, default=0.3, help="Acoustic style mixing (0-1)")
26
  parser.add_argument("--beta", type=float, default=0.7, help="Prosodic style mixing (0-1)")
 
20
  parser = argparse.ArgumentParser(description="Chiluka TTS Synthesis")
21
  parser.add_argument("--reference", "-r", required=True, help="Path to reference audio file")
22
  parser.add_argument("--text", "-t", default="Hello, this is Chiluka speaking!", help="Text to synthesize")
23
+ parser.add_argument("--language", "-l", default="en-us", help="Language code (en-us, te, hi, etc.)")
24
  parser.add_argument("--output", "-o", default="output.wav", help="Output WAV file path")
25
  parser.add_argument("--alpha", type=float, default=0.3, help="Acoustic style mixing (0-1)")
26
  parser.add_argument("--beta", type=float, default=0.7, help="Prosodic style mixing (0-1)")
examples/huggingface_example.py CHANGED
@@ -23,7 +23,7 @@ def main():
23
  parser.add_argument("--model", type=str, default="hindi_english", choices=["hindi_english", "telugu"],
24
  help="Model variant to use (default: hindi_english)")
25
  parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
26
- parser.add_argument("--language", type=str, default=None, help="Language code (en, hi, te)")
27
  parser.add_argument("--output", type=str, default="output_hf.wav", help="Output wav file path")
28
  parser.add_argument("--device", type=str, default=None, help="Device: cuda or cpu")
29
  args = parser.parse_args()
@@ -46,7 +46,7 @@ def main():
46
  if args.model == "telugu":
47
  args.language = "te"
48
  else:
49
- args.language = "en"
50
 
51
  # Load model from HuggingFace Hub (auto-downloads on first use)
52
  print(f"Loading '{args.model}' model from HuggingFace Hub...")
 
23
  parser.add_argument("--model", type=str, default="hindi_english", choices=["hindi_english", "telugu"],
24
  help="Model variant to use (default: hindi_english)")
25
  parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
26
+ parser.add_argument("--language", type=str, default=None, help="Language code (en-us, hi, te)")
27
  parser.add_argument("--output", type=str, default="output_hf.wav", help="Output wav file path")
28
  parser.add_argument("--device", type=str, default=None, help="Device: cuda or cpu")
29
  args = parser.parse_args()
 
46
  if args.model == "telugu":
47
  args.language = "te"
48
  else:
49
+ args.language = "en-us"
50
 
51
  # Load model from HuggingFace Hub (auto-downloads on first use)
52
  print(f"Loading '{args.model}' model from HuggingFace Hub...")
examples/pip_example.py CHANGED
@@ -22,7 +22,7 @@ def main():
22
  parser.add_argument("--model", type=str, default="hindi_english", choices=["hindi_english", "telugu"],
23
  help="Model variant (default: hindi_english)")
24
  parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
25
- parser.add_argument("--language", type=str, default=None, help="Language code (en, hi, te)")
26
  parser.add_argument("--output", type=str, default="output_pip.wav", help="Output wav file path")
27
  args = parser.parse_args()
28
 
@@ -38,7 +38,7 @@ def main():
38
  args.text = texts[args.model]
39
 
40
  if args.language is None:
41
- langs = {"hindi_english": "en", "telugu": "te"}
42
  args.language = langs[args.model]
43
 
44
  # List models
 
22
  parser.add_argument("--model", type=str, default="hindi_english", choices=["hindi_english", "telugu"],
23
  help="Model variant (default: hindi_english)")
24
  parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
25
+ parser.add_argument("--language", type=str, default=None, help="Language code (en-us, hi, te)")
26
  parser.add_argument("--output", type=str, default="output_pip.wav", help="Output wav file path")
27
  args = parser.parse_args()
28
 
 
38
  args.text = texts[args.model]
39
 
40
  if args.language is None:
41
+ langs = {"hindi_english": "en-us", "telugu": "te"}
42
  args.language = langs[args.model]
43
 
44
  # List models
examples/torchhub_example.py CHANGED
@@ -23,7 +23,7 @@ def main():
23
  parser.add_argument("--variant", type=str, default="default", choices=["default", "telugu", "hindi_english"],
24
  help="Model variant (default, telugu, hindi_english)")
25
  parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
26
- parser.add_argument("--language", type=str, default=None, help="Language code (en, hi, te)")
27
  parser.add_argument("--output", type=str, default="output_torchhub.wav", help="Output wav file path")
28
  args = parser.parse_args()
29
 
@@ -38,7 +38,7 @@ def main():
38
  if args.variant == "telugu":
39
  args.language = "te"
40
  else:
41
- args.language = "en"
42
 
43
  # Load via torch.hub
44
  # Available entry points:
 
23
  parser.add_argument("--variant", type=str, default="default", choices=["default", "telugu", "hindi_english"],
24
  help="Model variant (default, telugu, hindi_english)")
25
  parser.add_argument("--text", type=str, default=None, help="Text to synthesize")
26
+ parser.add_argument("--language", type=str, default=None, help="Language code (en-us, hi, te)")
27
  parser.add_argument("--output", type=str, default="output_torchhub.wav", help="Output wav file path")
28
  args = parser.parse_args()
29
 
 
38
  if args.variant == "telugu":
39
  args.language = "te"
40
  else:
41
+ args.language = "en-us"
42
 
43
  # Load via torch.hub
44
  # Available entry points:
hubconf.py CHANGED
@@ -14,7 +14,7 @@ Usage:
14
  wav = tts.synthesize(
15
  text="Hello, world!",
16
  reference_audio="path/to/reference.wav",
17
- language="en"
18
  )
19
  """
20
 
@@ -50,7 +50,7 @@ def chiluka(pretrained: bool = True, device: str = None, **kwargs):
50
  Example:
51
  >>> import torch
52
  >>> tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
53
- >>> wav = tts.synthesize("Hello!", "reference.wav", language="en")
54
  """
55
  from chiluka import Chiluka
56
 
 
14
  wav = tts.synthesize(
15
  text="Hello, world!",
16
  reference_audio="path/to/reference.wav",
17
+ language="en-us"
18
  )
19
  """
20
 
 
50
  Example:
51
  >>> import torch
52
  >>> tts = torch.hub.load('Seemanth/chiluka', 'chiluka')
53
+ >>> wav = tts.synthesize("Hello!", "reference.wav", language="en-us")
54
  """
55
  from chiluka import Chiluka
56