--- license: mit language: - en pipeline_tag: text-to-speech tags: - voice - speech - tts - vits - expressive-voice - gradio - neural-tts datasets: - Jinsaryko/Elise ---

Sonya TTS Logo

โœจ Sonya TTS

A Beautiful, Expressive Neural Voice Engine

High-fidelity AI speech with emotion, rhythm, and audiobook-quality narration

Hugging Face Hugging Face Demo Language VITS Python

--- ## ๐ŸŽง Listen to Sonya Experience the expressive quality of Sonya TTS:
*Extended narration showcasing rhythm control, natural pauses, and consistent tone across paragraphs. More examples in examples folder* Try Demo at Hugging Space Demo Hugging Face Demo --- ## ๐ŸŒธ About Sonya TTS **Sonya TTS** is a lightweight, expressive **single-speaker English Text-to-Speech model** built on the **VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)** architecture. Trained for approximately **10,000 steps** on a publicly available **expressive voice dataset**, Sonya delivers: - ๐ŸŽญ **Natural emotion and intonation** โ€” More human-like speech with genuine expressiveness - ๐ŸŽต **Smooth rhythm and prosody** โ€” Natural flow and timing in speech - ๐Ÿ“– **Long-form narration** โ€” Perfect for audiobook-style content with consistent quality - โšก **Blazing-fast inference** โ€” Optimized for both **GPU and CPU** deployment This isn't just a modelโ€”it's a complete, production-ready TTS system with a web interface, command-line tools, and audiobook narration capabilities. Github Repository: - https://github.com/Ashish-Patnaik/Sonya-TTS --- ## โœจ Key Features ### ๐ŸŽญ Expressive Voice Quality Unlike monotone TTS models, Sonya produces speech with natural emotion, dynamic intonation, and human-like expressiveness. Trained on an expressive dataset, it captures the nuances that make speech feel alive. ### โšก Lightning-Fast Inference Highly optimized for real-world deployment: - **GPU**: Extremely fast generation for real-time applications - **CPU**: Efficient performance for edge devices and local deployments - Low latency makes it suitable for interactive applications ### ๐Ÿ“– Audiobook Mode Built for long-form content with: - Intelligent sentence splitting and paragraph handling - Natural pauses between sentences - Consistent voice quality across extended text - Stable rhythm and pacing throughout ### ๐ŸŽ›๏ธ Fine-Grained Voice Control Customize speech output with intuitive parameters: - **Emotion (Noise Scale)** โ€” Control expressiveness and variation - **Rhythm (Noise Width)** โ€” Adjust timing and flow - **Speed (Length Scale)** โ€” Modify speaking rate ### ๐ŸŒ Open & Accessible Model weights and configuration files are publicly hosted on Hugging Face: - ๐Ÿ“ฆ **SafeTensors** format for secure, fast loading - ๐Ÿ”“ Available for research and experimentation - ๐Ÿš€ Easy integration with your projects --- ## โš ๏ธ Limitations & Transparency Sonya TTS is a research project and **not a perfect commercial solution**: - **Word skipping**: Occasionally skips or merges words in complex sentences - **Pronunciation**: Some uncommon words may be mispronounced - **Alignment artifacts**: Rare timing issues in very long passages - **Single speaker**: Currently supports only one English voice - **Language**: English only at this time Despite these limitations, Sonya demonstrates strong practical usability and expressive quality. --- ## ๐Ÿง  Training Journey This project was a deep dive into modern speech synthesis: | Detail | Value | |--------|-------| | **Architecture** | VITS (Conditional VAE + GAN) | | **Training Steps** | ~10,400 | | **Dataset** | Public expressive speech corpus | | **Language** | English | | **Speaker** | Single female voice | | **Training Focus** | Emotion, prosody, and long-form stability | ### What I Learned Building Sonya taught me invaluable lessons about: - Text-to-speech alignment mechanisms and attention - Prosody control and emotional expressiveness - Audio generation pipelines and vocoding - Model optimization for inference speed - Packaging and deployment of ML models - Real-world challenges in speech synthesis --- ## ๐Ÿ“ฆ Repository Structure ``` Sonya-TTS/ โ”œโ”€โ”€ checkpoints/ โ”‚ โ”œโ”€โ”€ sonya-tts.safetensors # Model weights (SafeTensors format) โ”‚ โ””โ”€โ”€ config.json # Model configuration โ”‚ โ”œโ”€โ”€ tts/ # Core model architecture โ”‚ โ”œโ”€โ”€ models.py โ”‚ โ”œโ”€โ”€ commons.py โ”‚ โ””โ”€โ”€ modules.py โ”‚ โ”œโ”€โ”€ text/ # Text processing pipeline โ”‚ โ”œโ”€โ”€ symbols.py โ”‚ โ”œโ”€โ”€ cleaners.py โ”‚ โ””โ”€โ”€ __init__.py โ”‚ โ”œโ”€โ”€ infer.py # CLI for short text synthesis โ”œโ”€โ”€ audiobook.py # Long-form narration script โ”œโ”€โ”€ webui.py # Gradio web interface โ”‚ โ”œโ”€โ”€ examples/ โ”‚ โ”œโ”€โ”€ short.wav # Quick speech demo โ”‚ โ””โ”€โ”€ long.wav # Audiobook demo โ”‚ โ”œโ”€โ”€ logo.png # Project logo โ”œโ”€โ”€ requirements.txt # Python dependencies โ””โ”€โ”€ README.md # This file ``` --- ## ๐Ÿš€ Installation & Setup ### Prerequisites - Python 3.10 or higher - Conda (recommended) or virtualenv - eSpeak-NG (for phonemization) ### Step 1: Create Environment ```bash # Create a new conda environment conda create -n sonya-tts python=3.10 -y # Activate the environment conda activate sonya-tts ``` ### Step 2: Install eSpeak-NG **๐ŸชŸ Windows** 1. Download the installer from [eSpeak-NG Releases](https://github.com/espeak-ng/espeak-ng/releases) 2. Run the installer and follow the setup wizard 3. Add eSpeak to your system PATH if not done automatically **๐Ÿง Linux (Ubuntu/Debian)** ```bash sudo apt update sudo apt install espeak-ng ``` **๐ŸŽ macOS** ```bash # Using Homebrew brew install espeak-ng ``` ### Step 3: Install Dependencies ```bash # Install all required Python packages pip install -r requirements.txt ``` ### Step 4: Launch Sonya TTS ```bash # Start the web interface python webui.py ``` The terminal will display a local URL (typically `http://127.0.0.1:7860`). Open it in your browser to access the interface! --- ## ๐ŸŽฏ Usage Options Sonya TTS provides three flexible ways to generate speech: ### 1๏ธโƒฃ `infer.py` Perfect for generating single audio files from short text: ```bash python infer.py ``` **Use Case**: Quick testing, automation scripts, batch processing ### 2๏ธโƒฃ `audiobook.py` โ€” Long-Form Narration Designed for extended text with intelligent sentence splitting: ```bash python audiobook.py ``` **Features**: - Automatic paragraph detection - Natural pauses between sentences - Consistent voice across long passages - Perfect for audiobooks, articles, and documentation ### 3๏ธโƒฃ `webui.py` โ€” Interactive Web Interface Beautiful Gradio-powered UI with real-time controls: ```bash python webui.py ``` **Features**: - Adjustable emotion, rhythm, and speed sliders - Audiobook mode toggle - Download generated audio - No coding required! --- ## ๐ŸŒ Model Hosting All model files are hosted on Hugging Face for easy access: **๐Ÿค— Model Repository**: [PatnaikAshish/Sonya-TTS](https://huggingface.co/PatnaikAshish/Sonya-TTS) **Files in `checkpoints/` directory**: - `sonya-tts.safetensors` โ€” Model weights (SafeTensors format) - `config.json` โ€” Model configuration and hyperparameters The code **automatically downloads** these files on first run if they're not present locally. No manual setup needed! --- ## ๐ŸŽ›๏ธ Advanced Configuration You can customize the voice output by adjusting these parameters: | Parameter | Range | Effect | |-----------|-------|--------| | **noise_scale** | 0.1 - 1.0 | Controls emotion and expressiveness (higher = more variation) | | **noise_scale_w** | 0.1 - 1.0 | Affects rhythm and timing (higher = more natural pauses) | | **length_scale** | 0.5 - 2.0 | Controls speaking speed (lower = faster, higher = slower) | Example in code: ```python text="Your text here", noise_scale=0.667, # Moderate emotion noise_scale_w=0.8, # Natural rhythm length_scale=1.0 # Normal speed ``` --- ## ๐Ÿ’ก Use Cases Sonya TTS is versatile and can be used for: - ๐Ÿ“š **Audiobook Production** โ€” Convert books and articles to speech - ๐ŸŽฎ **Game Narration** โ€” Dynamic voiceovers for indie games - ๐Ÿ“ฑ **Accessibility Tools** โ€” Screen readers and assistive technology - ๐ŸŽ“ **E-Learning** โ€” Educational content narration - ๐Ÿค– **Virtual Assistants** โ€” Expressive voice for chatbots - ๐Ÿ“ป **Podcast Intros** โ€” Quick voiceovers and announcements - ๐ŸŽฌ **Prototyping** โ€” Rapid audio mockups for videos --- ## ๐Ÿ”ง Technical Details ### VITS Architecture Sonya uses VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), which combines: - **Conditional VAE** for probabilistic acoustic modeling - **GAN-based training** for high-quality audio generation - **Normalizing flows** for flexible distribution modeling - **Stochastic duration prediction** for natural timing ### Performance Benchmarks - **GPU (NVIDIA RTX 3090)**: ~0.1s for 10 seconds of audio - **CPU (Intel i7-12700K)**: ~2s for 10 seconds of audio - Real-time factor: 10x-100x depending on hardware --- ## ๐Ÿ“œ License & Citation The project is MIT License and If you use Sonya TTS in your projects, please credit: ```bibtex @software{sonya_tts_2026, author = {Ashish Patnaik}, title = {Sonya TTS: An Expressive Neural Voice Engine}, year = {2026}, url = {https://huggingface.co/PatnaikAshish/Sonya-TTS} } ``` Also see the original repo about vits: ``` https://github.com/jaywalnut310/vits ``` --- ## ๐Ÿ’œ Final Words Sonya TTS represents countless hours of experimentation, training, debugging, and iteration. It's not perfectโ€”but it's real, it's fast, and it's expressive. This project taught me that building AI isn't just about achieving perfect metrics; it's about creating something useful, understanding the challenges deeply, and sharing knowledge with the community. If Sonya helps you in any wayโ€”whether for a project, learning, or just explorationโ€”I'd genuinely love to hear about it. โœจ **Thank you for listening to Sonya.** --- ## ๐Ÿ‘ค Author **Ashish Patnaik** ๐Ÿค— Hugging Face: [@PatnaikAshish](https://huggingface.co/PatnaikAshish) ๐Ÿ“ง Reach out for collaborations or questions! --- ## Acknowledgement 1. Dataset used for training :- https://huggingface.co/datasets/Jinsaryko/Elise 2. VITS model :- https://github.com/jaywalnut310/vits ## ๐Ÿ”— Quick Links - [๐Ÿค— Model on Hugging Face](https://huggingface.co/PatnaikAshish/Sonya-TTS) - [๐Ÿ“– VITS Paper](https://arxiv.org/abs/2106.06103) - [๐ŸŽค eSpeak-NG](https://github.com/espeak-ng/espeak-ng) ---

Made with ๐Ÿ’œ by Ashish Patnaik