Sonya-TTS / README.md
PatnaikAshish's picture
Update README.md
63350be verified
---
license: mit
language:
- en
pipeline_tag: text-to-speech
tags:
- voice
- speech
- tts
- vits
- expressive-voice
- gradio
- neural-tts
datasets:
- Jinsaryko/Elise
---
<p align="center">
<img src="logo.png" alt="Sonya TTS Logo" width="800"/>
</p>
<h1 align="center">โœจ Sonya TTS</h1>
<h3 align="center">A Beautiful, Expressive Neural Voice Engine</h3>
<p align="center">
<em>High-fidelity AI speech with emotion, rhythm, and audiobook-quality narration</em>
</p>
<p align="center">
<a href="https://huggingface.co/PatnaikAshish/Sonya-TTS">
<img src="https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Model-yellow" alt="Hugging Face"/>
</a>
<a href="https://huggingface.co/spaces/PatnaikAshish/Sonya-TTS">
<img src="https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Demo-yellow" alt="Hugging Face Demo"/>
</a>
<img src="https://img.shields.io/badge/Language-English-blue" alt="Language"/>
<img src="https://img.shields.io/badge/Architecture-VITS-green" alt="VITS"/>
<img src="https://img.shields.io/badge/Python-3.10-brightgreen" alt="Python"/>
</p>
---
## ๐ŸŽง Listen to Sonya
Experience the expressive quality of Sonya TTS:
<div align="center">
<video width="800" controls autoplay loop muted>
<source src="https://huggingface.co/PatnaikAshish/Sonya-TTS/resolve/main/demo.mp4" type="video/mp4">
</video>
</div>
*Extended narration showcasing rhythm control, natural pauses, and consistent tone across paragraphs. More examples in examples folder*
Try Demo at Hugging Space Demo
<a href="https://huggingface.co/spaces/PatnaikAshish/Sonya-TTS">
<img src="https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Demo-yellow" alt="Hugging Face Demo"/>
</a>
---
## ๐ŸŒธ About Sonya TTS
**Sonya TTS** is a lightweight, expressive **single-speaker English Text-to-Speech model** built on the **VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)** architecture.
Trained for approximately **10,000 steps** on a publicly available **expressive voice dataset**, Sonya delivers:
- ๐ŸŽญ **Natural emotion and intonation** โ€” More human-like speech with genuine expressiveness
- ๐ŸŽต **Smooth rhythm and prosody** โ€” Natural flow and timing in speech
- ๐Ÿ“– **Long-form narration** โ€” Perfect for audiobook-style content with consistent quality
- โšก **Blazing-fast inference** โ€” Optimized for both **GPU and CPU** deployment
This isn't just a modelโ€”it's a complete, production-ready TTS system with a web interface, command-line tools, and audiobook narration capabilities.
Github Repository: - https://github.com/Ashish-Patnaik/Sonya-TTS
---
## โœจ Key Features
### ๐ŸŽญ Expressive Voice Quality
Unlike monotone TTS models, Sonya produces speech with natural emotion, dynamic intonation, and human-like expressiveness. Trained on an expressive dataset, it captures the nuances that make speech feel alive.
### โšก Lightning-Fast Inference
Highly optimized for real-world deployment:
- **GPU**: Extremely fast generation for real-time applications
- **CPU**: Efficient performance for edge devices and local deployments
- Low latency makes it suitable for interactive applications
### ๐Ÿ“– Audiobook Mode
Built for long-form content with:
- Intelligent sentence splitting and paragraph handling
- Natural pauses between sentences
- Consistent voice quality across extended text
- Stable rhythm and pacing throughout
### ๐ŸŽ›๏ธ Fine-Grained Voice Control
Customize speech output with intuitive parameters:
- **Emotion (Noise Scale)** โ€” Control expressiveness and variation
- **Rhythm (Noise Width)** โ€” Adjust timing and flow
- **Speed (Length Scale)** โ€” Modify speaking rate
### ๐ŸŒ Open & Accessible
Model weights and configuration files are publicly hosted on Hugging Face:
- ๐Ÿ“ฆ **SafeTensors** format for secure, fast loading
- ๐Ÿ”“ Available for research and experimentation
- ๐Ÿš€ Easy integration with your projects
---
## โš ๏ธ Limitations & Transparency
Sonya TTS is a research project and **not a perfect commercial solution**:
- **Word skipping**: Occasionally skips or merges words in complex sentences
- **Pronunciation**: Some uncommon words may be mispronounced
- **Alignment artifacts**: Rare timing issues in very long passages
- **Single speaker**: Currently supports only one English voice
- **Language**: English only at this time
Despite these limitations, Sonya demonstrates strong practical usability and expressive quality.
---
## ๐Ÿง  Training Journey
This project was a deep dive into modern speech synthesis:
| Detail | Value |
|--------|-------|
| **Architecture** | VITS (Conditional VAE + GAN) |
| **Training Steps** | ~10,400 |
| **Dataset** | Public expressive speech corpus |
| **Language** | English |
| **Speaker** | Single female voice |
| **Training Focus** | Emotion, prosody, and long-form stability |
### What I Learned
Building Sonya taught me invaluable lessons about:
- Text-to-speech alignment mechanisms and attention
- Prosody control and emotional expressiveness
- Audio generation pipelines and vocoding
- Model optimization for inference speed
- Packaging and deployment of ML models
- Real-world challenges in speech synthesis
---
## ๐Ÿ“ฆ Repository Structure
```
Sonya-TTS/
โ”œโ”€โ”€ checkpoints/
โ”‚ โ”œโ”€โ”€ sonya-tts.safetensors # Model weights (SafeTensors format)
โ”‚ โ””โ”€โ”€ config.json # Model configuration
โ”‚
โ”œโ”€โ”€ tts/ # Core model architecture
โ”‚ โ”œโ”€โ”€ models.py
โ”‚ โ”œโ”€โ”€ commons.py
โ”‚ โ””โ”€โ”€ modules.py
โ”‚
โ”œโ”€โ”€ text/ # Text processing pipeline
โ”‚ โ”œโ”€โ”€ symbols.py
โ”‚ โ”œโ”€โ”€ cleaners.py
โ”‚ โ””โ”€โ”€ __init__.py
โ”‚
โ”œโ”€โ”€ infer.py # CLI for short text synthesis
โ”œโ”€โ”€ audiobook.py # Long-form narration script
โ”œโ”€โ”€ webui.py # Gradio web interface
โ”‚
โ”œโ”€โ”€ examples/
โ”‚ โ”œโ”€โ”€ short.wav # Quick speech demo
โ”‚ โ””โ”€โ”€ long.wav # Audiobook demo
โ”‚
โ”œโ”€โ”€ logo.png # Project logo
โ”œโ”€โ”€ requirements.txt # Python dependencies
โ””โ”€โ”€ README.md # This file
```
---
## ๐Ÿš€ Installation & Setup
### Prerequisites
- Python 3.10 or higher
- Conda (recommended) or virtualenv
- eSpeak-NG (for phonemization)
### Step 1: Create Environment
```bash
# Create a new conda environment
conda create -n sonya-tts python=3.10 -y
# Activate the environment
conda activate sonya-tts
```
### Step 2: Install eSpeak-NG
**๐ŸชŸ Windows**
1. Download the installer from [eSpeak-NG Releases](https://github.com/espeak-ng/espeak-ng/releases)
2. Run the installer and follow the setup wizard
3. Add eSpeak to your system PATH if not done automatically
**๐Ÿง Linux (Ubuntu/Debian)**
```bash
sudo apt update
sudo apt install espeak-ng
```
**๐ŸŽ macOS**
```bash
# Using Homebrew
brew install espeak-ng
```
### Step 3: Install Dependencies
```bash
# Install all required Python packages
pip install -r requirements.txt
```
### Step 4: Launch Sonya TTS
```bash
# Start the web interface
python webui.py
```
The terminal will display a local URL (typically `http://127.0.0.1:7860`). Open it in your browser to access the interface!
---
## ๐ŸŽฏ Usage Options
Sonya TTS provides three flexible ways to generate speech:
### 1๏ธโƒฃ `infer.py`
Perfect for generating single audio files from short text:
```bash
python infer.py
```
**Use Case**: Quick testing, automation scripts, batch processing
### 2๏ธโƒฃ `audiobook.py` โ€” Long-Form Narration
Designed for extended text with intelligent sentence splitting:
```bash
python audiobook.py
```
**Features**:
- Automatic paragraph detection
- Natural pauses between sentences
- Consistent voice across long passages
- Perfect for audiobooks, articles, and documentation
### 3๏ธโƒฃ `webui.py` โ€” Interactive Web Interface
Beautiful Gradio-powered UI with real-time controls:
```bash
python webui.py
```
**Features**:
- Adjustable emotion, rhythm, and speed sliders
- Audiobook mode toggle
- Download generated audio
- No coding required!
---
## ๐ŸŒ Model Hosting
All model files are hosted on Hugging Face for easy access:
**๐Ÿค— Model Repository**: [PatnaikAshish/Sonya-TTS](https://huggingface.co/PatnaikAshish/Sonya-TTS)
**Files in `checkpoints/` directory**:
- `sonya-tts.safetensors` โ€” Model weights (SafeTensors format)
- `config.json` โ€” Model configuration and hyperparameters
The code **automatically downloads** these files on first run if they're not present locally. No manual setup needed!
---
## ๐ŸŽ›๏ธ Advanced Configuration
You can customize the voice output by adjusting these parameters:
| Parameter | Range | Effect |
|-----------|-------|--------|
| **noise_scale** | 0.1 - 1.0 | Controls emotion and expressiveness (higher = more variation) |
| **noise_scale_w** | 0.1 - 1.0 | Affects rhythm and timing (higher = more natural pauses) |
| **length_scale** | 0.5 - 2.0 | Controls speaking speed (lower = faster, higher = slower) |
Example in code:
```python
text="Your text here",
noise_scale=0.667, # Moderate emotion
noise_scale_w=0.8, # Natural rhythm
length_scale=1.0 # Normal speed
```
---
## ๐Ÿ’ก Use Cases
Sonya TTS is versatile and can be used for:
- ๐Ÿ“š **Audiobook Production** โ€” Convert books and articles to speech
- ๐ŸŽฎ **Game Narration** โ€” Dynamic voiceovers for indie games
- ๐Ÿ“ฑ **Accessibility Tools** โ€” Screen readers and assistive technology
- ๐ŸŽ“ **E-Learning** โ€” Educational content narration
- ๐Ÿค– **Virtual Assistants** โ€” Expressive voice for chatbots
- ๐Ÿ“ป **Podcast Intros** โ€” Quick voiceovers and announcements
- ๐ŸŽฌ **Prototyping** โ€” Rapid audio mockups for videos
---
## ๐Ÿ”ง Technical Details
### VITS Architecture
Sonya uses VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), which combines:
- **Conditional VAE** for probabilistic acoustic modeling
- **GAN-based training** for high-quality audio generation
- **Normalizing flows** for flexible distribution modeling
- **Stochastic duration prediction** for natural timing
### Performance Benchmarks
- **GPU (NVIDIA RTX 3090)**: ~0.1s for 10 seconds of audio
- **CPU (Intel i7-12700K)**: ~2s for 10 seconds of audio
- Real-time factor: 10x-100x depending on hardware
---
## ๐Ÿ“œ License & Citation
The project is MIT License and If you use Sonya TTS in your projects, please credit:
```bibtex
@software{sonya_tts_2026,
author = {Ashish Patnaik},
title = {Sonya TTS: An Expressive Neural Voice Engine},
year = {2026},
url = {https://huggingface.co/PatnaikAshish/Sonya-TTS}
}
```
Also see the original repo about vits:
```
https://github.com/jaywalnut310/vits
```
---
## ๐Ÿ’œ Final Words
Sonya TTS represents countless hours of experimentation, training, debugging, and iteration. It's not perfectโ€”but it's real, it's fast, and it's expressive.
This project taught me that building AI isn't just about achieving perfect metrics; it's about creating something useful, understanding the challenges deeply, and sharing knowledge with the community.
If Sonya helps you in any wayโ€”whether for a project, learning, or just explorationโ€”I'd genuinely love to hear about it.
โœจ **Thank you for listening to Sonya.**
---
## ๐Ÿ‘ค Author
**Ashish Patnaik**
๐Ÿค— Hugging Face: [@PatnaikAshish](https://huggingface.co/PatnaikAshish)
๐Ÿ“ง Reach out for collaborations or questions!
---
## Acknowledgement
1. Dataset used for training :- https://huggingface.co/datasets/Jinsaryko/Elise
2. VITS model :- https://github.com/jaywalnut310/vits
## ๐Ÿ”— Quick Links
- [๐Ÿค— Model on Hugging Face](https://huggingface.co/PatnaikAshish/Sonya-TTS)
- [๐Ÿ“– VITS Paper](https://arxiv.org/abs/2106.06103)
- [๐ŸŽค eSpeak-NG](https://github.com/espeak-ng/espeak-ng)
---
<p align="center">
<sub>Made with ๐Ÿ’œ by Ashish Patnaik</sub>
</p>