Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.13.0
title: Text-to-Speech
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: apache-2.0
python_version: '3.10'
suggested_hardware: cpu-basic
short_description: High-quality TTS with 28 voices and style controls
tags:
- text-to-speech
- tts
- kokoro
- audio
ποΈ Kokoro TTS - Academic Text-to-Speech Application
Created by Yash Chowdhary
A comprehensive, open-source Text-to-Speech application built for academic learning and demonstration. Powered by the Kokoro-82M model β a lightweight yet high-quality TTS system with 82 million parameters.
β¨ Features
π 28 Built-in Voices
- 20 American English voices (11 female, 9 male)
- 8 British English voices (4 female, 4 male)
- Quality grades from A (premium) to D (basic)
- Each voice has unique characteristics and recommended use cases
π¨ 7 Style Presets
| Style | Description | Best For |
|---|---|---|
| Neutral Narrator | Clear, balanced narration | General content, documentation |
| Dramatic / Horror | Slower, deeper, suspenseful | Horror stories, dramatic readings |
| Excited / Surprised | Faster, higher energy | Announcements, exciting content |
| Calm / Meditative | Slow, soothing | Meditation guides, ASMR |
| Storyteller | Engaging narrative pace | Audiobooks, bedtime stories |
| Professional | Clear, authoritative | Business, corporate content |
| Cheerful / Friendly | Warm, upbeat | Tutorials, friendly explanations |
βοΈ Full Audio Control
- Speed: 0.5x (slow) to 2.0x (fast)
- Pitch: -5 to +5 semitones adjustment
- Pauses: 0-1000ms between sentences
- Real-time audio preview with download capability
π Quick Start
Option 1: Hugging Face Spaces (Recommended)
- Go to Hugging Face
- Create a new Space with Gradio SDK
- Upload all files from this repository
- The Space will automatically install dependencies and launch
Option 2: Local Installation
# Clone or download this repository
git clone <your-repo-url>
cd kokoro-tts-app
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install system dependencies (Linux/Ubuntu)
sudo apt-get update
sudo apt-get install -y espeak-ng ffmpeg libsndfile1
# Install Python dependencies
pip install -r requirements.txt
# Run the application
python app.py
The app will be available at http://localhost:7860
Option 3: Docker
FROM python:3.10-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
espeak-ng \
ffmpeg \
libsndfile1 \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY app.py .
# Run
CMD ["python", "app.py"]
π Usage Guide
Basic Usage
- Enter Text: Type or paste your text (up to 5000 characters)
- Select Voice: Choose from 28 available voices
- Pick a Style: Select a style preset that matches your content
- Generate: Click the "Generate Speech" button
- Download: Use the download button on the audio player
Using Style Presets
Style presets automatically configure speed, pitch, and pause settings:
β
"Use Style Preset Defaults" checked β Style settings applied
β "Use Style Preset Defaults" unchecked β Manual controls active
Advanced Customization
Uncheck "Use Style Preset Defaults" to manually control:
- Speed: Lower values = slower, more deliberate speech
- Pitch: Negative = deeper voice, Positive = higher voice
- Pause: Higher values = longer pauses between sentences
Sample Texts
The app includes sample texts for different scenarios:
- Welcome: General introduction text
- Horror: Spooky story excerpt (pair with Dramatic style)
- News: News broadcast style
- Story: Fairy tale opening (pair with Storyteller style)
- Technical: Technical documentation
π Voice Reference
American English - Female (11 voices)
| Voice ID | Name | Grade | Description |
|---|---|---|---|
af_heart |
Heart β€οΈ | A | Premium quality, warm and natural |
af_bella |
Bella π₯ | A- | Clear and expressive |
af_nicole |
Nicole π§ | B- | Professional narrator style |
af_aoede |
Aoede | C+ | Melodic and pleasant |
af_kore |
Kore | C+ | Youthful and energetic |
af_sarah |
Sarah | C+ | Friendly and approachable |
af_nova |
Nova | C | Modern and crisp |
af_sky |
Sky | C- | Light and airy |
af_alloy |
Alloy | C | Balanced and versatile |
af_jessica |
Jessica | D | Casual conversational |
af_river |
River | D | Gentle and flowing |
American English - Male (9 voices)
| Voice ID | Name | Grade | Description |
|---|---|---|---|
am_michael |
Michael | C+ | Authoritative and clear |
am_fenrir |
Fenrir | C+ | Deep and resonant |
am_puck |
Puck | C+ | Playful and dynamic |
am_echo |
Echo | D | Warm and reflective |
am_eric |
Eric | D | Professional and steady |
am_liam |
Liam | D | Young and natural |
am_onyx |
Onyx | D | Rich and smooth |
am_santa |
Santa π | D- | Jolly and festive |
am_adam |
Adam | F+ | Basic male voice |
British English - Female (4 voices)
| Voice ID | Name | Grade | Description |
|---|---|---|---|
bf_emma |
Emma | B- | Elegant British accent |
bf_isabella |
Isabella | C | Sophisticated and refined |
bf_alice |
Alice | D | Classic British tone |
bf_lily |
Lily | D | Soft and gentle |
British English - Male (4 voices)
| Voice ID | Name | Grade | Description |
|---|---|---|---|
bm_george |
George | C | Distinguished gentleman |
bm_fable |
Fable | C | Storyteller quality |
bm_lewis |
Lewis | D+ | Conversational British |
bm_daniel |
Daniel | D | Standard British male |
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gradio Web Interface β
β ββββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββββββββββββ β
β β Text Input β Voice/Style β Advanced Controls β β
β β β Selection β (Speed/Pitch/Pause) β β
β ββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Text Preprocessing β
β β’ Normalize abbreviations (Dr. β Doctor) β
β β’ Clean whitespace β
β β’ Character limit enforcement β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KokoroTTSEngine β
β β’ KPipeline (American 'a' / British 'b') β
β β’ Voice pack loading β
β β’ Phoneme conversion via Misaki G2P β
β β’ Neural audio synthesis β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Audio Post-Processing β
β β’ Pitch shifting (semitone adjustment) β
β β’ Pause insertion between segments β
β β’ Audio normalization (-3dB peak) β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Audio Output (24kHz WAV) β
β β’ Playback in browser β
β β’ Download capability β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Project Structure
kokoro-tts-app/
βββ app.py # Main Gradio application
βββ requirements.txt # Python dependencies
βββ packages.txt # System dependencies (for HF Spaces)
βββ README.md # This documentation
Code Organization (app.py)
# Section 1: Configuration & Constants
VOICE_CATALOG = {...} # Voice definitions
STYLE_PRESETS = {...} # Style preset configurations
# Section 2: Audio Processing Utilities
pitch_shift_audio() # Pitch manipulation
insert_pauses() # Silence injection
normalize_audio() # Volume normalization
preprocess_text() # Text cleaning
# Section 3: TTS Engine
class KokoroTTSEngine: # Main TTS wrapper
generate() # Basic generation
generate_with_style() # Style-based generation
# Section 4: Gradio Interface
create_voice_choices() # UI helper functions
generate_speech() # Main generation callback
demo = gr.Blocks(...) # Interface definition
π§ Technical Details
Model Specifications
| Attribute | Value |
|---|---|
| Model | Kokoro-82M |
| Parameters | 82 million |
| Model Size | ~330 MB |
| Sample Rate | 24,000 Hz |
| Audio Format | 32-bit float WAV |
| Languages | English (US & UK), Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese |
Resource Requirements
| Environment | CPU | RAM | Notes |
|---|---|---|---|
| HF Spaces (Free) | 2 vCPU | 16 GB | Recommended |
| Local (Minimum) | 2 cores | 4 GB | Functional |
| Local (Recommended) | 4 cores | 8 GB | Faster inference |
Performance Benchmarks (CPU)
| Text Length | Approx. Generation Time |
|---|---|
| 100 chars | 2-4 seconds |
| 500 chars | 8-15 seconds |
| 1000 chars | 15-30 seconds |
| 5000 chars | 60-120 seconds |
π Academic Use Cases
This project is designed for learning and demonstration:
- Understanding TTS Pipelines: Explore how text is converted to phonemes and then to audio
- Audio Signal Processing: Learn about pitch shifting, normalization, and pause insertion
- ML Model Deployment: Practice deploying models on Hugging Face Spaces
- UI/UX Design: Build intuitive interfaces with Gradio
- Code Organization: Study modular, well-documented Python code
π Learning Resources
- Kokoro Model Card - Official model documentation
- Misaki G2P - Grapheme-to-phoneme library
- Gradio Documentation - UI framework
- Hugging Face Spaces - Deployment platform
π€ Contributing
This is an academic project, but contributions are welcome:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
π License
This project is licensed under the Apache License 2.0, the same license as the Kokoro-82M model.
Copyright 2024 Academic Project
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
π Acknowledgments
- hexgrad - Creator of Kokoro-82M
- Hugging Face - Model hosting and Spaces platform
- Gradio - Web interface framework
π Support
For questions or issues:
- Check the Kokoro Discussions
- Review the Gradio Docs
- Open an issue in this repository
Built with β€οΈ for academic learning and open-source AI