|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- voice |
|
|
- speech |
|
|
- tts |
|
|
- vits |
|
|
- expressive-voice |
|
|
- gradio |
|
|
- neural-tts |
|
|
datasets: |
|
|
- Jinsaryko/Elise |
|
|
--- |
|
|
<p align="center"> |
|
|
<img src="logo.png" alt="Sonya TTS Logo" width="800"/> |
|
|
</p> |
|
|
|
|
|
<h1 align="center">โจ Sonya TTS</h1> |
|
|
<h3 align="center">A Beautiful, Expressive Neural Voice Engine</h3> |
|
|
|
|
|
<p align="center"> |
|
|
<em>High-fidelity AI speech with emotion, rhythm, and audiobook-quality narration</em> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://huggingface.co/PatnaikAshish/Sonya-TTS"> |
|
|
<img src="https://img.shields.io/badge/๐ค%20Hugging%20Face-Model-yellow" alt="Hugging Face"/> |
|
|
</a> |
|
|
<a href="https://huggingface.co/spaces/PatnaikAshish/Sonya-TTS"> |
|
|
<img src="https://img.shields.io/badge/๐ค%20Hugging%20Face-Demo-yellow" alt="Hugging Face Demo"/> |
|
|
</a> |
|
|
<img src="https://img.shields.io/badge/Language-English-blue" alt="Language"/> |
|
|
<img src="https://img.shields.io/badge/Architecture-VITS-green" alt="VITS"/> |
|
|
<img src="https://img.shields.io/badge/Python-3.10-brightgreen" alt="Python"/> |
|
|
|
|
|
</p> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ง Listen to Sonya |
|
|
|
|
|
Experience the expressive quality of Sonya TTS: |
|
|
|
|
|
<div align="center"> |
|
|
<video width="800" controls autoplay loop muted> |
|
|
<source src="https://huggingface.co/PatnaikAshish/Sonya-TTS/resolve/main/demo.mp4" type="video/mp4"> |
|
|
</video> |
|
|
</div> |
|
|
|
|
|
*Extended narration showcasing rhythm control, natural pauses, and consistent tone across paragraphs. More examples in examples folder* |
|
|
|
|
|
Try Demo at Hugging Space Demo |
|
|
<a href="https://huggingface.co/spaces/PatnaikAshish/Sonya-TTS"> |
|
|
<img src="https://img.shields.io/badge/๐ค%20Hugging%20Face-Demo-yellow" alt="Hugging Face Demo"/> |
|
|
</a> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ธ About Sonya TTS |
|
|
|
|
|
**Sonya TTS** is a lightweight, expressive **single-speaker English Text-to-Speech model** built on the **VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)** architecture. |
|
|
|
|
|
Trained for approximately **10,000 steps** on a publicly available **expressive voice dataset**, Sonya delivers: |
|
|
|
|
|
- ๐ญ **Natural emotion and intonation** โ More human-like speech with genuine expressiveness |
|
|
- ๐ต **Smooth rhythm and prosody** โ Natural flow and timing in speech |
|
|
- ๐ **Long-form narration** โ Perfect for audiobook-style content with consistent quality |
|
|
- โก **Blazing-fast inference** โ Optimized for both **GPU and CPU** deployment |
|
|
|
|
|
This isn't just a modelโit's a complete, production-ready TTS system with a web interface, command-line tools, and audiobook narration capabilities. |
|
|
|
|
|
Github Repository: - https://github.com/Ashish-Patnaik/Sonya-TTS |
|
|
|
|
|
--- |
|
|
|
|
|
## โจ Key Features |
|
|
|
|
|
### ๐ญ Expressive Voice Quality |
|
|
Unlike monotone TTS models, Sonya produces speech with natural emotion, dynamic intonation, and human-like expressiveness. Trained on an expressive dataset, it captures the nuances that make speech feel alive. |
|
|
|
|
|
### โก Lightning-Fast Inference |
|
|
Highly optimized for real-world deployment: |
|
|
- **GPU**: Extremely fast generation for real-time applications |
|
|
- **CPU**: Efficient performance for edge devices and local deployments |
|
|
- Low latency makes it suitable for interactive applications |
|
|
|
|
|
### ๐ Audiobook Mode |
|
|
Built for long-form content with: |
|
|
- Intelligent sentence splitting and paragraph handling |
|
|
- Natural pauses between sentences |
|
|
- Consistent voice quality across extended text |
|
|
- Stable rhythm and pacing throughout |
|
|
|
|
|
### ๐๏ธ Fine-Grained Voice Control |
|
|
Customize speech output with intuitive parameters: |
|
|
- **Emotion (Noise Scale)** โ Control expressiveness and variation |
|
|
- **Rhythm (Noise Width)** โ Adjust timing and flow |
|
|
- **Speed (Length Scale)** โ Modify speaking rate |
|
|
|
|
|
### ๐ Open & Accessible |
|
|
Model weights and configuration files are publicly hosted on Hugging Face: |
|
|
- ๐ฆ **SafeTensors** format for secure, fast loading |
|
|
- ๐ Available for research and experimentation |
|
|
- ๐ Easy integration with your projects |
|
|
|
|
|
--- |
|
|
|
|
|
## โ ๏ธ Limitations & Transparency |
|
|
|
|
|
Sonya TTS is a research project and **not a perfect commercial solution**: |
|
|
|
|
|
- **Word skipping**: Occasionally skips or merges words in complex sentences |
|
|
- **Pronunciation**: Some uncommon words may be mispronounced |
|
|
- **Alignment artifacts**: Rare timing issues in very long passages |
|
|
- **Single speaker**: Currently supports only one English voice |
|
|
- **Language**: English only at this time |
|
|
|
|
|
Despite these limitations, Sonya demonstrates strong practical usability and expressive quality. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ง Training Journey |
|
|
|
|
|
This project was a deep dive into modern speech synthesis: |
|
|
|
|
|
| Detail | Value | |
|
|
|--------|-------| |
|
|
| **Architecture** | VITS (Conditional VAE + GAN) | |
|
|
| **Training Steps** | ~10,400 | |
|
|
| **Dataset** | Public expressive speech corpus | |
|
|
| **Language** | English | |
|
|
| **Speaker** | Single female voice | |
|
|
| **Training Focus** | Emotion, prosody, and long-form stability | |
|
|
|
|
|
### What I Learned |
|
|
Building Sonya taught me invaluable lessons about: |
|
|
- Text-to-speech alignment mechanisms and attention |
|
|
- Prosody control and emotional expressiveness |
|
|
- Audio generation pipelines and vocoding |
|
|
- Model optimization for inference speed |
|
|
- Packaging and deployment of ML models |
|
|
- Real-world challenges in speech synthesis |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฆ Repository Structure |
|
|
|
|
|
``` |
|
|
Sonya-TTS/ |
|
|
โโโ checkpoints/ |
|
|
โ โโโ sonya-tts.safetensors # Model weights (SafeTensors format) |
|
|
โ โโโ config.json # Model configuration |
|
|
โ |
|
|
โโโ tts/ # Core model architecture |
|
|
โ โโโ models.py |
|
|
โ โโโ commons.py |
|
|
โ โโโ modules.py |
|
|
โ |
|
|
โโโ text/ # Text processing pipeline |
|
|
โ โโโ symbols.py |
|
|
โ โโโ cleaners.py |
|
|
โ โโโ __init__.py |
|
|
โ |
|
|
โโโ infer.py # CLI for short text synthesis |
|
|
โโโ audiobook.py # Long-form narration script |
|
|
โโโ webui.py # Gradio web interface |
|
|
โ |
|
|
โโโ examples/ |
|
|
โ โโโ short.wav # Quick speech demo |
|
|
โ โโโ long.wav # Audiobook demo |
|
|
โ |
|
|
โโโ logo.png # Project logo |
|
|
โโโ requirements.txt # Python dependencies |
|
|
โโโ README.md # This file |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Installation & Setup |
|
|
|
|
|
### Prerequisites |
|
|
- Python 3.10 or higher |
|
|
- Conda (recommended) or virtualenv |
|
|
- eSpeak-NG (for phonemization) |
|
|
|
|
|
### Step 1: Create Environment |
|
|
|
|
|
```bash |
|
|
# Create a new conda environment |
|
|
conda create -n sonya-tts python=3.10 -y |
|
|
|
|
|
# Activate the environment |
|
|
conda activate sonya-tts |
|
|
``` |
|
|
|
|
|
### Step 2: Install eSpeak-NG |
|
|
|
|
|
**๐ช Windows** |
|
|
1. Download the installer from [eSpeak-NG Releases](https://github.com/espeak-ng/espeak-ng/releases) |
|
|
2. Run the installer and follow the setup wizard |
|
|
3. Add eSpeak to your system PATH if not done automatically |
|
|
|
|
|
**๐ง Linux (Ubuntu/Debian)** |
|
|
```bash |
|
|
sudo apt update |
|
|
sudo apt install espeak-ng |
|
|
``` |
|
|
|
|
|
**๐ macOS** |
|
|
```bash |
|
|
# Using Homebrew |
|
|
brew install espeak-ng |
|
|
``` |
|
|
|
|
|
### Step 3: Install Dependencies |
|
|
|
|
|
```bash |
|
|
# Install all required Python packages |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Step 4: Launch Sonya TTS |
|
|
|
|
|
```bash |
|
|
# Start the web interface |
|
|
python webui.py |
|
|
``` |
|
|
|
|
|
The terminal will display a local URL (typically `http://127.0.0.1:7860`). Open it in your browser to access the interface! |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฏ Usage Options |
|
|
|
|
|
Sonya TTS provides three flexible ways to generate speech: |
|
|
|
|
|
### 1๏ธโฃ `infer.py` |
|
|
|
|
|
Perfect for generating single audio files from short text: |
|
|
|
|
|
```bash |
|
|
python infer.py |
|
|
``` |
|
|
|
|
|
**Use Case**: Quick testing, automation scripts, batch processing |
|
|
|
|
|
### 2๏ธโฃ `audiobook.py` โ Long-Form Narration |
|
|
|
|
|
Designed for extended text with intelligent sentence splitting: |
|
|
|
|
|
```bash |
|
|
python audiobook.py |
|
|
``` |
|
|
|
|
|
**Features**: |
|
|
- Automatic paragraph detection |
|
|
- Natural pauses between sentences |
|
|
- Consistent voice across long passages |
|
|
- Perfect for audiobooks, articles, and documentation |
|
|
|
|
|
### 3๏ธโฃ `webui.py` โ Interactive Web Interface |
|
|
|
|
|
Beautiful Gradio-powered UI with real-time controls: |
|
|
|
|
|
```bash |
|
|
python webui.py |
|
|
``` |
|
|
|
|
|
**Features**: |
|
|
- Adjustable emotion, rhythm, and speed sliders |
|
|
- Audiobook mode toggle |
|
|
- Download generated audio |
|
|
- No coding required! |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Model Hosting |
|
|
|
|
|
All model files are hosted on Hugging Face for easy access: |
|
|
|
|
|
**๐ค Model Repository**: [PatnaikAshish/Sonya-TTS](https://huggingface.co/PatnaikAshish/Sonya-TTS) |
|
|
|
|
|
**Files in `checkpoints/` directory**: |
|
|
- `sonya-tts.safetensors` โ Model weights (SafeTensors format) |
|
|
- `config.json` โ Model configuration and hyperparameters |
|
|
|
|
|
The code **automatically downloads** these files on first run if they're not present locally. No manual setup needed! |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐๏ธ Advanced Configuration |
|
|
|
|
|
You can customize the voice output by adjusting these parameters: |
|
|
|
|
|
| Parameter | Range | Effect | |
|
|
|-----------|-------|--------| |
|
|
| **noise_scale** | 0.1 - 1.0 | Controls emotion and expressiveness (higher = more variation) | |
|
|
| **noise_scale_w** | 0.1 - 1.0 | Affects rhythm and timing (higher = more natural pauses) | |
|
|
| **length_scale** | 0.5 - 2.0 | Controls speaking speed (lower = faster, higher = slower) | |
|
|
|
|
|
Example in code: |
|
|
```python |
|
|
text="Your text here", |
|
|
noise_scale=0.667, # Moderate emotion |
|
|
noise_scale_w=0.8, # Natural rhythm |
|
|
length_scale=1.0 # Normal speed |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ก Use Cases |
|
|
|
|
|
Sonya TTS is versatile and can be used for: |
|
|
|
|
|
- ๐ **Audiobook Production** โ Convert books and articles to speech |
|
|
- ๐ฎ **Game Narration** โ Dynamic voiceovers for indie games |
|
|
- ๐ฑ **Accessibility Tools** โ Screen readers and assistive technology |
|
|
- ๐ **E-Learning** โ Educational content narration |
|
|
- ๐ค **Virtual Assistants** โ Expressive voice for chatbots |
|
|
- ๐ป **Podcast Intros** โ Quick voiceovers and announcements |
|
|
- ๐ฌ **Prototyping** โ Rapid audio mockups for videos |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ง Technical Details |
|
|
|
|
|
### VITS Architecture |
|
|
Sonya uses VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), which combines: |
|
|
- **Conditional VAE** for probabilistic acoustic modeling |
|
|
- **GAN-based training** for high-quality audio generation |
|
|
- **Normalizing flows** for flexible distribution modeling |
|
|
- **Stochastic duration prediction** for natural timing |
|
|
|
|
|
### Performance Benchmarks |
|
|
- **GPU (NVIDIA RTX 3090)**: ~0.1s for 10 seconds of audio |
|
|
- **CPU (Intel i7-12700K)**: ~2s for 10 seconds of audio |
|
|
- Real-time factor: 10x-100x depending on hardware |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ License & Citation |
|
|
The project is MIT License and If you use Sonya TTS in your projects, please credit: |
|
|
|
|
|
```bibtex |
|
|
@software{sonya_tts_2026, |
|
|
author = {Ashish Patnaik}, |
|
|
title = {Sonya TTS: An Expressive Neural Voice Engine}, |
|
|
year = {2026}, |
|
|
url = {https://huggingface.co/PatnaikAshish/Sonya-TTS} |
|
|
} |
|
|
``` |
|
|
Also see the original repo about vits: |
|
|
``` |
|
|
https://github.com/jaywalnut310/vits |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Final Words |
|
|
|
|
|
Sonya TTS represents countless hours of experimentation, training, debugging, and iteration. It's not perfectโbut it's real, it's fast, and it's expressive. |
|
|
|
|
|
This project taught me that building AI isn't just about achieving perfect metrics; it's about creating something useful, understanding the challenges deeply, and sharing knowledge with the community. |
|
|
|
|
|
If Sonya helps you in any wayโwhether for a project, learning, or just explorationโI'd genuinely love to hear about it. |
|
|
|
|
|
โจ **Thank you for listening to Sonya.** |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ค Author |
|
|
|
|
|
**Ashish Patnaik** |
|
|
๐ค Hugging Face: [@PatnaikAshish](https://huggingface.co/PatnaikAshish) |
|
|
๐ง Reach out for collaborations or questions! |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgement |
|
|
1. Dataset used for training :- https://huggingface.co/datasets/Jinsaryko/Elise |
|
|
2. VITS model :- https://github.com/jaywalnut310/vits |
|
|
|
|
|
|
|
|
## ๐ Quick Links |
|
|
|
|
|
- [๐ค Model on Hugging Face](https://huggingface.co/PatnaikAshish/Sonya-TTS) |
|
|
- [๐ VITS Paper](https://arxiv.org/abs/2106.06103) |
|
|
- [๐ค eSpeak-NG](https://github.com/espeak-ng/espeak-ng) |
|
|
|
|
|
--- |
|
|
|
|
|
<p align="center"> |
|
|
<sub>Made with ๐ by Ashish Patnaik</sub> |
|
|
</p> |