File size: 11,979 Bytes
fbcf14d
 
 
 
 
 
 
 
 
c5ff508
 
 
 
8c0a7ec
 
fbcf14d
30a5ba6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5ff508
 
 
30a5ba6
 
 
c5ff508
30a5ba6
 
 
 
 
 
 
 
d7d9d03
80d9f4f
 
 
d7d9d03
30a5ba6
d7d9d03
30a5ba6
c5ff508
 
 
 
 
30a5ba6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5ff508
 
30a5ba6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63350be
 
 
 
 
30a5ba6
 
 
 
 
 
 
 
 
 
fbcf14d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
---
license: mit
language:
- en
pipeline_tag: text-to-speech
tags:
- voice
- speech
- tts
- vits
- expressive-voice
- gradio
- neural-tts
datasets:
- Jinsaryko/Elise
---
<p align="center">
  <img src="logo.png" alt="Sonya TTS Logo" width="800"/>
</p>

<h1 align="center">โœจ Sonya TTS</h1>
<h3 align="center">A Beautiful, Expressive Neural Voice Engine</h3>

<p align="center">
  <em>High-fidelity AI speech with emotion, rhythm, and audiobook-quality narration</em>
</p>

<p align="center">
  <a href="https://huggingface.co/PatnaikAshish/Sonya-TTS">
    <img src="https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Model-yellow" alt="Hugging Face"/>
  </a>
  <a href="https://huggingface.co/spaces/PatnaikAshish/Sonya-TTS">
    <img src="https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Demo-yellow" alt="Hugging Face Demo"/>
  </a>
  <img src="https://img.shields.io/badge/Language-English-blue" alt="Language"/>
  <img src="https://img.shields.io/badge/Architecture-VITS-green" alt="VITS"/>
  <img src="https://img.shields.io/badge/Python-3.10-brightgreen" alt="Python"/>
  
</p>

---

## ๐ŸŽง Listen to Sonya

Experience the expressive quality of Sonya TTS:

<div align="center">
  <video width="800" controls autoplay loop muted>
    <source src="https://huggingface.co/PatnaikAshish/Sonya-TTS/resolve/main/demo.mp4" type="video/mp4">
  </video>
</div>

*Extended narration showcasing rhythm control, natural pauses, and consistent tone across paragraphs. More examples in examples folder*

Try Demo at Hugging Space Demo
<a href="https://huggingface.co/spaces/PatnaikAshish/Sonya-TTS">
    <img src="https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Demo-yellow" alt="Hugging Face Demo"/>
</a>

---

## ๐ŸŒธ About Sonya TTS

**Sonya TTS** is a lightweight, expressive **single-speaker English Text-to-Speech model** built on the **VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)** architecture.

Trained for approximately **10,000 steps** on a publicly available **expressive voice dataset**, Sonya delivers:

- ๐ŸŽญ **Natural emotion and intonation** โ€” More human-like speech with genuine expressiveness
- ๐ŸŽต **Smooth rhythm and prosody** โ€” Natural flow and timing in speech
- ๐Ÿ“– **Long-form narration** โ€” Perfect for audiobook-style content with consistent quality
- โšก **Blazing-fast inference** โ€” Optimized for both **GPU and CPU** deployment

This isn't just a modelโ€”it's a complete, production-ready TTS system with a web interface, command-line tools, and audiobook narration capabilities.

Github Repository: - https://github.com/Ashish-Patnaik/Sonya-TTS

---

## โœจ Key Features

### ๐ŸŽญ Expressive Voice Quality
Unlike monotone TTS models, Sonya produces speech with natural emotion, dynamic intonation, and human-like expressiveness. Trained on an expressive dataset, it captures the nuances that make speech feel alive.

### โšก Lightning-Fast Inference
Highly optimized for real-world deployment:
- **GPU**: Extremely fast generation for real-time applications
- **CPU**: Efficient performance for edge devices and local deployments
- Low latency makes it suitable for interactive applications

### ๐Ÿ“– Audiobook Mode
Built for long-form content with:
- Intelligent sentence splitting and paragraph handling
- Natural pauses between sentences
- Consistent voice quality across extended text
- Stable rhythm and pacing throughout

### ๐ŸŽ›๏ธ Fine-Grained Voice Control
Customize speech output with intuitive parameters:
- **Emotion (Noise Scale)** โ€” Control expressiveness and variation
- **Rhythm (Noise Width)** โ€” Adjust timing and flow
- **Speed (Length Scale)** โ€” Modify speaking rate

### ๐ŸŒ Open & Accessible
Model weights and configuration files are publicly hosted on Hugging Face:
- ๐Ÿ“ฆ **SafeTensors** format for secure, fast loading
- ๐Ÿ”“ Available for research and experimentation
- ๐Ÿš€ Easy integration with your projects

---

## โš ๏ธ Limitations & Transparency

Sonya TTS is a research project and **not a perfect commercial solution**:

- **Word skipping**: Occasionally skips or merges words in complex sentences
- **Pronunciation**: Some uncommon words may be mispronounced
- **Alignment artifacts**: Rare timing issues in very long passages
- **Single speaker**: Currently supports only one English voice
- **Language**: English only at this time

Despite these limitations, Sonya demonstrates strong practical usability and expressive quality.

---

## ๐Ÿง  Training Journey

This project was a deep dive into modern speech synthesis:

| Detail | Value |
|--------|-------|
| **Architecture** | VITS (Conditional VAE + GAN) |
| **Training Steps** | ~10,400 |
| **Dataset** | Public expressive speech corpus |
| **Language** | English |
| **Speaker** | Single female voice |
| **Training Focus** | Emotion, prosody, and long-form stability |

### What I Learned
Building Sonya taught me invaluable lessons about:
- Text-to-speech alignment mechanisms and attention
- Prosody control and emotional expressiveness
- Audio generation pipelines and vocoding
- Model optimization for inference speed
- Packaging and deployment of ML models
- Real-world challenges in speech synthesis

---

## ๐Ÿ“ฆ Repository Structure

```
Sonya-TTS/
โ”œโ”€โ”€ checkpoints/
โ”‚   โ”œโ”€โ”€ sonya-tts.safetensors    # Model weights (SafeTensors format)
โ”‚   โ””โ”€โ”€ config.json              # Model configuration
โ”‚
โ”œโ”€โ”€ tts/                         # Core model architecture
โ”‚   โ”œโ”€โ”€ models.py
โ”‚   โ”œโ”€โ”€ commons.py
โ”‚   โ””โ”€โ”€ modules.py
โ”‚
โ”œโ”€โ”€ text/                        # Text processing pipeline
โ”‚   โ”œโ”€โ”€ symbols.py
โ”‚   โ”œโ”€โ”€ cleaners.py
โ”‚   โ””โ”€โ”€ __init__.py
โ”‚
โ”œโ”€โ”€ infer.py                     # CLI for short text synthesis
โ”œโ”€โ”€ audiobook.py                 # Long-form narration script
โ”œโ”€โ”€ webui.py                     # Gradio web interface
โ”‚
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ short.wav                # Quick speech demo
โ”‚   โ””โ”€โ”€ long.wav                 # Audiobook demo
โ”‚
โ”œโ”€โ”€ logo.png                     # Project logo
โ”œโ”€โ”€ requirements.txt             # Python dependencies
โ””โ”€โ”€ README.md                    # This file
```

---

## ๐Ÿš€ Installation & Setup

### Prerequisites
- Python 3.10 or higher
- Conda (recommended) or virtualenv
- eSpeak-NG (for phonemization)

### Step 1: Create Environment

```bash
# Create a new conda environment
conda create -n sonya-tts python=3.10 -y

# Activate the environment
conda activate sonya-tts
```

### Step 2: Install eSpeak-NG

**๐ŸชŸ Windows**
1. Download the installer from [eSpeak-NG Releases](https://github.com/espeak-ng/espeak-ng/releases)
2. Run the installer and follow the setup wizard
3. Add eSpeak to your system PATH if not done automatically

**๐Ÿง Linux (Ubuntu/Debian)**
```bash
sudo apt update
sudo apt install espeak-ng
```

**๐ŸŽ macOS**
```bash
# Using Homebrew
brew install espeak-ng
```

### Step 3: Install Dependencies

```bash
# Install all required Python packages
pip install -r requirements.txt
```

### Step 4: Launch Sonya TTS

```bash
# Start the web interface
python webui.py
```

The terminal will display a local URL (typically `http://127.0.0.1:7860`). Open it in your browser to access the interface!

---

## ๐ŸŽฏ Usage Options

Sonya TTS provides three flexible ways to generate speech:

### 1๏ธโƒฃ `infer.py`

Perfect for generating single audio files from short text:

```bash
python infer.py 
```

**Use Case**: Quick testing, automation scripts, batch processing

### 2๏ธโƒฃ `audiobook.py` โ€” Long-Form Narration

Designed for extended text with intelligent sentence splitting:

```bash
python audiobook.py 
```

**Features**:
- Automatic paragraph detection
- Natural pauses between sentences
- Consistent voice across long passages
- Perfect for audiobooks, articles, and documentation

### 3๏ธโƒฃ `webui.py` โ€” Interactive Web Interface

Beautiful Gradio-powered UI with real-time controls:

```bash
python webui.py
```

**Features**:
- Adjustable emotion, rhythm, and speed sliders
- Audiobook mode toggle
- Download generated audio
- No coding required!

---

## ๐ŸŒ Model Hosting

All model files are hosted on Hugging Face for easy access:

**๐Ÿค— Model Repository**: [PatnaikAshish/Sonya-TTS](https://huggingface.co/PatnaikAshish/Sonya-TTS)

**Files in `checkpoints/` directory**:
- `sonya-tts.safetensors` โ€” Model weights (SafeTensors format)
- `config.json` โ€” Model configuration and hyperparameters

The code **automatically downloads** these files on first run if they're not present locally. No manual setup needed!

---

## ๐ŸŽ›๏ธ Advanced Configuration

You can customize the voice output by adjusting these parameters:

| Parameter | Range | Effect |
|-----------|-------|--------|
| **noise_scale** | 0.1 - 1.0 | Controls emotion and expressiveness (higher = more variation) |
| **noise_scale_w** | 0.1 - 1.0 | Affects rhythm and timing (higher = more natural pauses) |
| **length_scale** | 0.5 - 2.0 | Controls speaking speed (lower = faster, higher = slower) |

Example in code:
```python
    text="Your text here",
    noise_scale=0.667,      # Moderate emotion
    noise_scale_w=0.8,      # Natural rhythm
    length_scale=1.0        # Normal speed
```

---

## ๐Ÿ’ก Use Cases

Sonya TTS is versatile and can be used for:

- ๐Ÿ“š **Audiobook Production** โ€” Convert books and articles to speech
- ๐ŸŽฎ **Game Narration** โ€” Dynamic voiceovers for indie games
- ๐Ÿ“ฑ **Accessibility Tools** โ€” Screen readers and assistive technology
- ๐ŸŽ“ **E-Learning** โ€” Educational content narration
- ๐Ÿค– **Virtual Assistants** โ€” Expressive voice for chatbots
- ๐Ÿ“ป **Podcast Intros** โ€” Quick voiceovers and announcements
- ๐ŸŽฌ **Prototyping** โ€” Rapid audio mockups for videos

---

## ๐Ÿ”ง Technical Details

### VITS Architecture
Sonya uses VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), which combines:
- **Conditional VAE** for probabilistic acoustic modeling
- **GAN-based training** for high-quality audio generation
- **Normalizing flows** for flexible distribution modeling
- **Stochastic duration prediction** for natural timing

### Performance Benchmarks
- **GPU (NVIDIA RTX 3090)**: ~0.1s for 10 seconds of audio
- **CPU (Intel i7-12700K)**: ~2s for 10 seconds of audio
- Real-time factor: 10x-100x depending on hardware

---

## ๐Ÿ“œ License & Citation
The project is MIT License and If you use Sonya TTS in your projects, please credit:

```bibtex
@software{sonya_tts_2026,
  author = {Ashish Patnaik},
  title = {Sonya TTS: An Expressive Neural Voice Engine},
  year = {2026},
  url = {https://huggingface.co/PatnaikAshish/Sonya-TTS}
}
```
Also see the original repo about vits:
```
https://github.com/jaywalnut310/vits
```

---

## ๐Ÿ’œ Final Words

Sonya TTS represents countless hours of experimentation, training, debugging, and iteration. It's not perfectโ€”but it's real, it's fast, and it's expressive.

This project taught me that building AI isn't just about achieving perfect metrics; it's about creating something useful, understanding the challenges deeply, and sharing knowledge with the community.

If Sonya helps you in any wayโ€”whether for a project, learning, or just explorationโ€”I'd genuinely love to hear about it.

โœจ **Thank you for listening to Sonya.**

---

## ๐Ÿ‘ค Author

**Ashish Patnaik**  
๐Ÿค— Hugging Face: [@PatnaikAshish](https://huggingface.co/PatnaikAshish)  
๐Ÿ“ง Reach out for collaborations or questions!

---

## Acknowledgement
1. Dataset used for training :- https://huggingface.co/datasets/Jinsaryko/Elise
2. VITS model :- https://github.com/jaywalnut310/vits
   

## ๐Ÿ”— Quick Links

- [๐Ÿค— Model on Hugging Face](https://huggingface.co/PatnaikAshish/Sonya-TTS)
- [๐Ÿ“– VITS Paper](https://arxiv.org/abs/2106.06103)
- [๐ŸŽค eSpeak-NG](https://github.com/espeak-ng/espeak-ng)

---

<p align="center">
  <sub>Made with ๐Ÿ’œ by Ashish Patnaik</sub>
</p>