File size: 11,979 Bytes
fbcf14d c5ff508 8c0a7ec fbcf14d 30a5ba6 c5ff508 30a5ba6 c5ff508 30a5ba6 d7d9d03 80d9f4f d7d9d03 30a5ba6 d7d9d03 30a5ba6 c5ff508 30a5ba6 c5ff508 30a5ba6 63350be 30a5ba6 fbcf14d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 | ---
license: mit
language:
- en
pipeline_tag: text-to-speech
tags:
- voice
- speech
- tts
- vits
- expressive-voice
- gradio
- neural-tts
datasets:
- Jinsaryko/Elise
---
<p align="center">
<img src="logo.png" alt="Sonya TTS Logo" width="800"/>
</p>
<h1 align="center">โจ Sonya TTS</h1>
<h3 align="center">A Beautiful, Expressive Neural Voice Engine</h3>
<p align="center">
<em>High-fidelity AI speech with emotion, rhythm, and audiobook-quality narration</em>
</p>
<p align="center">
<a href="https://huggingface.co/PatnaikAshish/Sonya-TTS">
<img src="https://img.shields.io/badge/๐ค%20Hugging%20Face-Model-yellow" alt="Hugging Face"/>
</a>
<a href="https://huggingface.co/spaces/PatnaikAshish/Sonya-TTS">
<img src="https://img.shields.io/badge/๐ค%20Hugging%20Face-Demo-yellow" alt="Hugging Face Demo"/>
</a>
<img src="https://img.shields.io/badge/Language-English-blue" alt="Language"/>
<img src="https://img.shields.io/badge/Architecture-VITS-green" alt="VITS"/>
<img src="https://img.shields.io/badge/Python-3.10-brightgreen" alt="Python"/>
</p>
---
## ๐ง Listen to Sonya
Experience the expressive quality of Sonya TTS:
<div align="center">
<video width="800" controls autoplay loop muted>
<source src="https://huggingface.co/PatnaikAshish/Sonya-TTS/resolve/main/demo.mp4" type="video/mp4">
</video>
</div>
*Extended narration showcasing rhythm control, natural pauses, and consistent tone across paragraphs. More examples in examples folder*
Try Demo at Hugging Space Demo
<a href="https://huggingface.co/spaces/PatnaikAshish/Sonya-TTS">
<img src="https://img.shields.io/badge/๐ค%20Hugging%20Face-Demo-yellow" alt="Hugging Face Demo"/>
</a>
---
## ๐ธ About Sonya TTS
**Sonya TTS** is a lightweight, expressive **single-speaker English Text-to-Speech model** built on the **VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)** architecture.
Trained for approximately **10,000 steps** on a publicly available **expressive voice dataset**, Sonya delivers:
- ๐ญ **Natural emotion and intonation** โ More human-like speech with genuine expressiveness
- ๐ต **Smooth rhythm and prosody** โ Natural flow and timing in speech
- ๐ **Long-form narration** โ Perfect for audiobook-style content with consistent quality
- โก **Blazing-fast inference** โ Optimized for both **GPU and CPU** deployment
This isn't just a modelโit's a complete, production-ready TTS system with a web interface, command-line tools, and audiobook narration capabilities.
Github Repository: - https://github.com/Ashish-Patnaik/Sonya-TTS
---
## โจ Key Features
### ๐ญ Expressive Voice Quality
Unlike monotone TTS models, Sonya produces speech with natural emotion, dynamic intonation, and human-like expressiveness. Trained on an expressive dataset, it captures the nuances that make speech feel alive.
### โก Lightning-Fast Inference
Highly optimized for real-world deployment:
- **GPU**: Extremely fast generation for real-time applications
- **CPU**: Efficient performance for edge devices and local deployments
- Low latency makes it suitable for interactive applications
### ๐ Audiobook Mode
Built for long-form content with:
- Intelligent sentence splitting and paragraph handling
- Natural pauses between sentences
- Consistent voice quality across extended text
- Stable rhythm and pacing throughout
### ๐๏ธ Fine-Grained Voice Control
Customize speech output with intuitive parameters:
- **Emotion (Noise Scale)** โ Control expressiveness and variation
- **Rhythm (Noise Width)** โ Adjust timing and flow
- **Speed (Length Scale)** โ Modify speaking rate
### ๐ Open & Accessible
Model weights and configuration files are publicly hosted on Hugging Face:
- ๐ฆ **SafeTensors** format for secure, fast loading
- ๐ Available for research and experimentation
- ๐ Easy integration with your projects
---
## โ ๏ธ Limitations & Transparency
Sonya TTS is a research project and **not a perfect commercial solution**:
- **Word skipping**: Occasionally skips or merges words in complex sentences
- **Pronunciation**: Some uncommon words may be mispronounced
- **Alignment artifacts**: Rare timing issues in very long passages
- **Single speaker**: Currently supports only one English voice
- **Language**: English only at this time
Despite these limitations, Sonya demonstrates strong practical usability and expressive quality.
---
## ๐ง Training Journey
This project was a deep dive into modern speech synthesis:
| Detail | Value |
|--------|-------|
| **Architecture** | VITS (Conditional VAE + GAN) |
| **Training Steps** | ~10,400 |
| **Dataset** | Public expressive speech corpus |
| **Language** | English |
| **Speaker** | Single female voice |
| **Training Focus** | Emotion, prosody, and long-form stability |
### What I Learned
Building Sonya taught me invaluable lessons about:
- Text-to-speech alignment mechanisms and attention
- Prosody control and emotional expressiveness
- Audio generation pipelines and vocoding
- Model optimization for inference speed
- Packaging and deployment of ML models
- Real-world challenges in speech synthesis
---
## ๐ฆ Repository Structure
```
Sonya-TTS/
โโโ checkpoints/
โ โโโ sonya-tts.safetensors # Model weights (SafeTensors format)
โ โโโ config.json # Model configuration
โ
โโโ tts/ # Core model architecture
โ โโโ models.py
โ โโโ commons.py
โ โโโ modules.py
โ
โโโ text/ # Text processing pipeline
โ โโโ symbols.py
โ โโโ cleaners.py
โ โโโ __init__.py
โ
โโโ infer.py # CLI for short text synthesis
โโโ audiobook.py # Long-form narration script
โโโ webui.py # Gradio web interface
โ
โโโ examples/
โ โโโ short.wav # Quick speech demo
โ โโโ long.wav # Audiobook demo
โ
โโโ logo.png # Project logo
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
```
---
## ๐ Installation & Setup
### Prerequisites
- Python 3.10 or higher
- Conda (recommended) or virtualenv
- eSpeak-NG (for phonemization)
### Step 1: Create Environment
```bash
# Create a new conda environment
conda create -n sonya-tts python=3.10 -y
# Activate the environment
conda activate sonya-tts
```
### Step 2: Install eSpeak-NG
**๐ช Windows**
1. Download the installer from [eSpeak-NG Releases](https://github.com/espeak-ng/espeak-ng/releases)
2. Run the installer and follow the setup wizard
3. Add eSpeak to your system PATH if not done automatically
**๐ง Linux (Ubuntu/Debian)**
```bash
sudo apt update
sudo apt install espeak-ng
```
**๐ macOS**
```bash
# Using Homebrew
brew install espeak-ng
```
### Step 3: Install Dependencies
```bash
# Install all required Python packages
pip install -r requirements.txt
```
### Step 4: Launch Sonya TTS
```bash
# Start the web interface
python webui.py
```
The terminal will display a local URL (typically `http://127.0.0.1:7860`). Open it in your browser to access the interface!
---
## ๐ฏ Usage Options
Sonya TTS provides three flexible ways to generate speech:
### 1๏ธโฃ `infer.py`
Perfect for generating single audio files from short text:
```bash
python infer.py
```
**Use Case**: Quick testing, automation scripts, batch processing
### 2๏ธโฃ `audiobook.py` โ Long-Form Narration
Designed for extended text with intelligent sentence splitting:
```bash
python audiobook.py
```
**Features**:
- Automatic paragraph detection
- Natural pauses between sentences
- Consistent voice across long passages
- Perfect for audiobooks, articles, and documentation
### 3๏ธโฃ `webui.py` โ Interactive Web Interface
Beautiful Gradio-powered UI with real-time controls:
```bash
python webui.py
```
**Features**:
- Adjustable emotion, rhythm, and speed sliders
- Audiobook mode toggle
- Download generated audio
- No coding required!
---
## ๐ Model Hosting
All model files are hosted on Hugging Face for easy access:
**๐ค Model Repository**: [PatnaikAshish/Sonya-TTS](https://huggingface.co/PatnaikAshish/Sonya-TTS)
**Files in `checkpoints/` directory**:
- `sonya-tts.safetensors` โ Model weights (SafeTensors format)
- `config.json` โ Model configuration and hyperparameters
The code **automatically downloads** these files on first run if they're not present locally. No manual setup needed!
---
## ๐๏ธ Advanced Configuration
You can customize the voice output by adjusting these parameters:
| Parameter | Range | Effect |
|-----------|-------|--------|
| **noise_scale** | 0.1 - 1.0 | Controls emotion and expressiveness (higher = more variation) |
| **noise_scale_w** | 0.1 - 1.0 | Affects rhythm and timing (higher = more natural pauses) |
| **length_scale** | 0.5 - 2.0 | Controls speaking speed (lower = faster, higher = slower) |
Example in code:
```python
text="Your text here",
noise_scale=0.667, # Moderate emotion
noise_scale_w=0.8, # Natural rhythm
length_scale=1.0 # Normal speed
```
---
## ๐ก Use Cases
Sonya TTS is versatile and can be used for:
- ๐ **Audiobook Production** โ Convert books and articles to speech
- ๐ฎ **Game Narration** โ Dynamic voiceovers for indie games
- ๐ฑ **Accessibility Tools** โ Screen readers and assistive technology
- ๐ **E-Learning** โ Educational content narration
- ๐ค **Virtual Assistants** โ Expressive voice for chatbots
- ๐ป **Podcast Intros** โ Quick voiceovers and announcements
- ๐ฌ **Prototyping** โ Rapid audio mockups for videos
---
## ๐ง Technical Details
### VITS Architecture
Sonya uses VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), which combines:
- **Conditional VAE** for probabilistic acoustic modeling
- **GAN-based training** for high-quality audio generation
- **Normalizing flows** for flexible distribution modeling
- **Stochastic duration prediction** for natural timing
### Performance Benchmarks
- **GPU (NVIDIA RTX 3090)**: ~0.1s for 10 seconds of audio
- **CPU (Intel i7-12700K)**: ~2s for 10 seconds of audio
- Real-time factor: 10x-100x depending on hardware
---
## ๐ License & Citation
The project is MIT License and If you use Sonya TTS in your projects, please credit:
```bibtex
@software{sonya_tts_2026,
author = {Ashish Patnaik},
title = {Sonya TTS: An Expressive Neural Voice Engine},
year = {2026},
url = {https://huggingface.co/PatnaikAshish/Sonya-TTS}
}
```
Also see the original repo about vits:
```
https://github.com/jaywalnut310/vits
```
---
## ๐ Final Words
Sonya TTS represents countless hours of experimentation, training, debugging, and iteration. It's not perfectโbut it's real, it's fast, and it's expressive.
This project taught me that building AI isn't just about achieving perfect metrics; it's about creating something useful, understanding the challenges deeply, and sharing knowledge with the community.
If Sonya helps you in any wayโwhether for a project, learning, or just explorationโI'd genuinely love to hear about it.
โจ **Thank you for listening to Sonya.**
---
## ๐ค Author
**Ashish Patnaik**
๐ค Hugging Face: [@PatnaikAshish](https://huggingface.co/PatnaikAshish)
๐ง Reach out for collaborations or questions!
---
## Acknowledgement
1. Dataset used for training :- https://huggingface.co/datasets/Jinsaryko/Elise
2. VITS model :- https://github.com/jaywalnut310/vits
## ๐ Quick Links
- [๐ค Model on Hugging Face](https://huggingface.co/PatnaikAshish/Sonya-TTS)
- [๐ VITS Paper](https://arxiv.org/abs/2106.06103)
- [๐ค eSpeak-NG](https://github.com/espeak-ng/espeak-ng)
---
<p align="center">
<sub>Made with ๐ by Ashish Patnaik</sub>
</p> |