Text_To_Speech / README.md
YashChowdhary's picture
Update README.md
35552ea verified

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: Text-to-Speech
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: apache-2.0
python_version: '3.10'
suggested_hardware: cpu-basic
short_description: High-quality TTS with 28 voices and style controls
tags:
  - text-to-speech
  - tts
  - kokoro
  - audio

πŸŽ™οΈ Kokoro TTS - Academic Text-to-Speech Application

Created by Yash Chowdhary

Hugging Face Spaces License: Apache 2.0 Python 3.10

A comprehensive, open-source Text-to-Speech application built for academic learning and demonstration. Powered by the Kokoro-82M model β€” a lightweight yet high-quality TTS system with 82 million parameters.

Kokoro TTS Demo


✨ Features

🎭 28 Built-in Voices

  • 20 American English voices (11 female, 9 male)
  • 8 British English voices (4 female, 4 male)
  • Quality grades from A (premium) to D (basic)
  • Each voice has unique characteristics and recommended use cases

🎨 7 Style Presets

Style Description Best For
Neutral Narrator Clear, balanced narration General content, documentation
Dramatic / Horror Slower, deeper, suspenseful Horror stories, dramatic readings
Excited / Surprised Faster, higher energy Announcements, exciting content
Calm / Meditative Slow, soothing Meditation guides, ASMR
Storyteller Engaging narrative pace Audiobooks, bedtime stories
Professional Clear, authoritative Business, corporate content
Cheerful / Friendly Warm, upbeat Tutorials, friendly explanations

βš™οΈ Full Audio Control

  • Speed: 0.5x (slow) to 2.0x (fast)
  • Pitch: -5 to +5 semitones adjustment
  • Pauses: 0-1000ms between sentences
  • Real-time audio preview with download capability

πŸš€ Quick Start

Option 1: Hugging Face Spaces (Recommended)

  1. Go to Hugging Face
  2. Create a new Space with Gradio SDK
  3. Upload all files from this repository
  4. The Space will automatically install dependencies and launch

Option 2: Local Installation

# Clone or download this repository
git clone <your-repo-url>
cd kokoro-tts-app

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install system dependencies (Linux/Ubuntu)
sudo apt-get update
sudo apt-get install -y espeak-ng ffmpeg libsndfile1

# Install Python dependencies
pip install -r requirements.txt

# Run the application
python app.py

The app will be available at http://localhost:7860

Option 3: Docker

FROM python:3.10-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    espeak-ng \
    ffmpeg \
    libsndfile1 \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY app.py .

# Run
CMD ["python", "app.py"]

πŸ“– Usage Guide

Basic Usage

  1. Enter Text: Type or paste your text (up to 5000 characters)
  2. Select Voice: Choose from 28 available voices
  3. Pick a Style: Select a style preset that matches your content
  4. Generate: Click the "Generate Speech" button
  5. Download: Use the download button on the audio player

Using Style Presets

Style presets automatically configure speed, pitch, and pause settings:

βœ… "Use Style Preset Defaults" checked β†’ Style settings applied
❌ "Use Style Preset Defaults" unchecked β†’ Manual controls active

Advanced Customization

Uncheck "Use Style Preset Defaults" to manually control:

  • Speed: Lower values = slower, more deliberate speech
  • Pitch: Negative = deeper voice, Positive = higher voice
  • Pause: Higher values = longer pauses between sentences

Sample Texts

The app includes sample texts for different scenarios:

  • Welcome: General introduction text
  • Horror: Spooky story excerpt (pair with Dramatic style)
  • News: News broadcast style
  • Story: Fairy tale opening (pair with Storyteller style)
  • Technical: Technical documentation

🎭 Voice Reference

American English - Female (11 voices)

Voice ID Name Grade Description
af_heart Heart ❀️ A Premium quality, warm and natural
af_bella Bella πŸ”₯ A- Clear and expressive
af_nicole Nicole 🎧 B- Professional narrator style
af_aoede Aoede C+ Melodic and pleasant
af_kore Kore C+ Youthful and energetic
af_sarah Sarah C+ Friendly and approachable
af_nova Nova C Modern and crisp
af_sky Sky C- Light and airy
af_alloy Alloy C Balanced and versatile
af_jessica Jessica D Casual conversational
af_river River D Gentle and flowing

American English - Male (9 voices)

Voice ID Name Grade Description
am_michael Michael C+ Authoritative and clear
am_fenrir Fenrir C+ Deep and resonant
am_puck Puck C+ Playful and dynamic
am_echo Echo D Warm and reflective
am_eric Eric D Professional and steady
am_liam Liam D Young and natural
am_onyx Onyx D Rich and smooth
am_santa Santa πŸŽ… D- Jolly and festive
am_adam Adam F+ Basic male voice

British English - Female (4 voices)

Voice ID Name Grade Description
bf_emma Emma B- Elegant British accent
bf_isabella Isabella C Sophisticated and refined
bf_alice Alice D Classic British tone
bf_lily Lily D Soft and gentle

British English - Male (4 voices)

Voice ID Name Grade Description
bm_george George C Distinguished gentleman
bm_fable Fable C Storyteller quality
bm_lewis Lewis D+ Conversational British
bm_daniel Daniel D Standard British male

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Gradio Web Interface                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Text Input  β”‚ Voice/Style   β”‚   Advanced Controls     β”‚ β”‚
β”‚  β”‚              β”‚   Selection   β”‚  (Speed/Pitch/Pause)    β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Text Preprocessing                          β”‚
β”‚  β€’ Normalize abbreviations (Dr. β†’ Doctor)                   β”‚
β”‚  β€’ Clean whitespace                                          β”‚
β”‚  β€’ Character limit enforcement                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 KokoroTTSEngine                              β”‚
β”‚  β€’ KPipeline (American 'a' / British 'b')                   β”‚
β”‚  β€’ Voice pack loading                                        β”‚
β”‚  β€’ Phoneme conversion via Misaki G2P                        β”‚
β”‚  β€’ Neural audio synthesis                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Audio Post-Processing                         β”‚
β”‚  β€’ Pitch shifting (semitone adjustment)                     β”‚
β”‚  β€’ Pause insertion between segments                         β”‚
β”‚  β€’ Audio normalization (-3dB peak)                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Audio Output (24kHz WAV)                        β”‚
β”‚  β€’ Playback in browser                                       β”‚
β”‚  β€’ Download capability                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

kokoro-tts-app/
β”œβ”€β”€ app.py              # Main Gradio application
β”œβ”€β”€ requirements.txt    # Python dependencies
β”œβ”€β”€ packages.txt        # System dependencies (for HF Spaces)
└── README.md           # This documentation

Code Organization (app.py)

# Section 1: Configuration & Constants
VOICE_CATALOG = {...}       # Voice definitions
STYLE_PRESETS = {...}       # Style preset configurations

# Section 2: Audio Processing Utilities
pitch_shift_audio()         # Pitch manipulation
insert_pauses()             # Silence injection
normalize_audio()           # Volume normalization
preprocess_text()           # Text cleaning

# Section 3: TTS Engine
class KokoroTTSEngine:      # Main TTS wrapper
    generate()              # Basic generation
    generate_with_style()   # Style-based generation

# Section 4: Gradio Interface
create_voice_choices()      # UI helper functions
generate_speech()           # Main generation callback
demo = gr.Blocks(...)       # Interface definition

πŸ”§ Technical Details

Model Specifications

Attribute Value
Model Kokoro-82M
Parameters 82 million
Model Size ~330 MB
Sample Rate 24,000 Hz
Audio Format 32-bit float WAV
Languages English (US & UK), Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese

Resource Requirements

Environment CPU RAM Notes
HF Spaces (Free) 2 vCPU 16 GB Recommended
Local (Minimum) 2 cores 4 GB Functional
Local (Recommended) 4 cores 8 GB Faster inference

Performance Benchmarks (CPU)

Text Length Approx. Generation Time
100 chars 2-4 seconds
500 chars 8-15 seconds
1000 chars 15-30 seconds
5000 chars 60-120 seconds

πŸŽ“ Academic Use Cases

This project is designed for learning and demonstration:

  1. Understanding TTS Pipelines: Explore how text is converted to phonemes and then to audio
  2. Audio Signal Processing: Learn about pitch shifting, normalization, and pause insertion
  3. ML Model Deployment: Practice deploying models on Hugging Face Spaces
  4. UI/UX Design: Build intuitive interfaces with Gradio
  5. Code Organization: Study modular, well-documented Python code

πŸ“š Learning Resources


🀝 Contributing

This is an academic project, but contributions are welcome:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

πŸ“„ License

This project is licensed under the Apache License 2.0, the same license as the Kokoro-82M model.

Copyright 2024 Academic Project

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

πŸ™ Acknowledgments


πŸ“ž Support

For questions or issues:

  1. Check the Kokoro Discussions
  2. Review the Gradio Docs
  3. Open an issue in this repository

Built with ❀️ for academic learning and open-source AI