Spaces:

YashChowdhary
/

Text_To_Speech

Sleeping

App Files Files Community

Text_To_Speech / README.md

YashChowdhary

Update README.md

35552ea verified 2 months ago

preview code

raw

history blame contribute delete

14.7 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

metadata

title: Text-to-Speech
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: apache-2.0
python_version: '3.10'
suggested_hardware: cpu-basic
short_description: High-quality TTS with 28 voices and style controls
tags:
  - text-to-speech
  - tts
  - kokoro
  - audio

🎙️ Kokoro TTS - Academic Text-to-Speech Application

Created by Yash Chowdhary

A comprehensive, open-source Text-to-Speech application built for academic learning and demonstration. Powered by the Kokoro-82M model — a lightweight yet high-quality TTS system with 82 million parameters.

✨ Features

🎭 28 Built-in Voices

20 American English voices (11 female, 9 male)
8 British English voices (4 female, 4 male)
Quality grades from A (premium) to D (basic)
Each voice has unique characteristics and recommended use cases

🎨 7 Style Presets

Style	Description	Best For
Neutral Narrator	Clear, balanced narration	General content, documentation
Dramatic / Horror	Slower, deeper, suspenseful	Horror stories, dramatic readings
Excited / Surprised	Faster, higher energy	Announcements, exciting content
Calm / Meditative	Slow, soothing	Meditation guides, ASMR
Storyteller	Engaging narrative pace	Audiobooks, bedtime stories
Professional	Clear, authoritative	Business, corporate content
Cheerful / Friendly	Warm, upbeat	Tutorials, friendly explanations

⚙️ Full Audio Control

Speed: 0.5x (slow) to 2.0x (fast)
Pitch: -5 to +5 semitones adjustment
Pauses: 0-1000ms between sentences
Real-time audio preview with download capability

🚀 Quick Start

Option 1: Hugging Face Spaces (Recommended)

Go to Hugging Face
Create a new Space with Gradio SDK
Upload all files from this repository
The Space will automatically install dependencies and launch

Option 2: Local Installation

# Clone or download this repository
git clone <your-repo-url>
cd kokoro-tts-app

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install system dependencies (Linux/Ubuntu)
sudo apt-get update
sudo apt-get install -y espeak-ng ffmpeg libsndfile1

# Install Python dependencies
pip install -r requirements.txt

# Run the application
python app.py

The app will be available at http://localhost:7860

Option 3: Docker

FROM python:3.10-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    espeak-ng \
    ffmpeg \
    libsndfile1 \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY app.py .

# Run
CMD ["python", "app.py"]

📖 Usage Guide

Basic Usage

Enter Text: Type or paste your text (up to 5000 characters)
Select Voice: Choose from 28 available voices
Pick a Style: Select a style preset that matches your content
Generate: Click the "Generate Speech" button
Download: Use the download button on the audio player

Using Style Presets

Style presets automatically configure speed, pitch, and pause settings:

✅ "Use Style Preset Defaults" checked → Style settings applied
❌ "Use Style Preset Defaults" unchecked → Manual controls active

Advanced Customization

Uncheck "Use Style Preset Defaults" to manually control:

Speed: Lower values = slower, more deliberate speech
Pitch: Negative = deeper voice, Positive = higher voice
Pause: Higher values = longer pauses between sentences

Sample Texts

The app includes sample texts for different scenarios:

Welcome: General introduction text
Horror: Spooky story excerpt (pair with Dramatic style)
News: News broadcast style
Story: Fairy tale opening (pair with Storyteller style)
Technical: Technical documentation

🎭 Voice Reference

American English - Female (11 voices)

Voice ID	Name	Grade	Description
`af_heart`	Heart ❤️	A	Premium quality, warm and natural
`af_bella`	Bella 🔥	A-	Clear and expressive
`af_nicole`	Nicole 🎧	B-	Professional narrator style
`af_aoede`	Aoede	C+	Melodic and pleasant
`af_kore`	Kore	C+	Youthful and energetic
`af_sarah`	Sarah	C+	Friendly and approachable
`af_nova`	Nova	C	Modern and crisp
`af_sky`	Sky	C-	Light and airy
`af_alloy`	Alloy	C	Balanced and versatile
`af_jessica`	Jessica	D	Casual conversational
`af_river`	River	D	Gentle and flowing

American English - Male (9 voices)

Voice ID	Name	Grade	Description
`am_michael`	Michael	C+	Authoritative and clear
`am_fenrir`	Fenrir	C+	Deep and resonant
`am_puck`	Puck	C+	Playful and dynamic
`am_echo`	Echo	D	Warm and reflective
`am_eric`	Eric	D	Professional and steady
`am_liam`	Liam	D	Young and natural
`am_onyx`	Onyx	D	Rich and smooth
`am_santa`	Santa 🎅	D-	Jolly and festive
`am_adam`	Adam	F+	Basic male voice

British English - Female (4 voices)

Voice ID	Name	Grade	Description
`bf_emma`	Emma	B-	Elegant British accent
`bf_isabella`	Isabella	C	Sophisticated and refined
`bf_alice`	Alice	D	Classic British tone
`bf_lily`	Lily	D	Soft and gentle

British English - Male (4 voices)

Voice ID	Name	Grade	Description
`bm_george`	George	C	Distinguished gentleman
`bm_fable`	Fable	C	Storyteller quality
`bm_lewis`	Lewis	D+	Conversational British
`bm_daniel`	Daniel	D	Standard British male

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Gradio Web Interface                      │
│  ┌──────────────┬───────────────┬─────────────────────────┐ │
│  │  Text Input  │ Voice/Style   │   Advanced Controls     │ │
│  │              │   Selection   │  (Speed/Pitch/Pause)    │ │
│  └──────────────┴───────────────┴─────────────────────────┘ │
└─────────────────────────────┬───────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  Text Preprocessing                          │
│  • Normalize abbreviations (Dr. → Doctor)                   │
│  • Clean whitespace                                          │
│  • Character limit enforcement                               │
└─────────────────────────────┬───────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                 KokoroTTSEngine                              │
│  • KPipeline (American 'a' / British 'b')                   │
│  • Voice pack loading                                        │
│  • Phoneme conversion via Misaki G2P                        │
│  • Neural audio synthesis                                    │
└─────────────────────────────┬───────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                Audio Post-Processing                         │
│  • Pitch shifting (semitone adjustment)                     │
│  • Pause insertion between segments                         │
│  • Audio normalization (-3dB peak)                          │
└─────────────────────────────┬───────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Audio Output (24kHz WAV)                        │
│  • Playback in browser                                       │
│  • Download capability                                       │
└─────────────────────────────────────────────────────────────┘

📁 Project Structure

kokoro-tts-app/
├── app.py              # Main Gradio application
├── requirements.txt    # Python dependencies
├── packages.txt        # System dependencies (for HF Spaces)
└── README.md           # This documentation

Code Organization (app.py)

# Section 1: Configuration & Constants
VOICE_CATALOG = {...}       # Voice definitions
STYLE_PRESETS = {...}       # Style preset configurations

# Section 2: Audio Processing Utilities
pitch_shift_audio()         # Pitch manipulation
insert_pauses()             # Silence injection
normalize_audio()           # Volume normalization
preprocess_text()           # Text cleaning

# Section 3: TTS Engine
class KokoroTTSEngine:      # Main TTS wrapper
    generate()              # Basic generation
    generate_with_style()   # Style-based generation

# Section 4: Gradio Interface
create_voice_choices()      # UI helper functions
generate_speech()           # Main generation callback
demo = gr.Blocks(...)       # Interface definition

🔧 Technical Details

Model Specifications

Attribute	Value
Model	Kokoro-82M
Parameters	82 million
Model Size	~330 MB
Sample Rate	24,000 Hz
Audio Format	32-bit float WAV
Languages	English (US & UK), Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese

Resource Requirements

Environment	CPU	RAM	Notes
HF Spaces (Free)	2 vCPU	16 GB	Recommended
Local (Minimum)	2 cores	4 GB	Functional
Local (Recommended)	4 cores	8 GB	Faster inference

Performance Benchmarks (CPU)

Text Length	Approx. Generation Time
100 chars	2-4 seconds
500 chars	8-15 seconds
1000 chars	15-30 seconds
5000 chars	60-120 seconds

🎓 Academic Use Cases

This project is designed for learning and demonstration:

Understanding TTS Pipelines: Explore how text is converted to phonemes and then to audio
Audio Signal Processing: Learn about pitch shifting, normalization, and pause insertion
ML Model Deployment: Practice deploying models on Hugging Face Spaces
UI/UX Design: Build intuitive interfaces with Gradio
Code Organization: Study modular, well-documented Python code

📚 Learning Resources

Kokoro Model Card - Official model documentation
Misaki G2P - Grapheme-to-phoneme library
Gradio Documentation - UI framework
Hugging Face Spaces - Deployment platform

🤝 Contributing

This is an academic project, but contributions are welcome:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

📄 License

This project is licensed under the Apache License 2.0, the same license as the Kokoro-82M model.

Copyright 2024 Academic Project

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

🙏 Acknowledgments

hexgrad - Creator of Kokoro-82M
Hugging Face - Model hosting and Spaces platform
Gradio - Web interface framework

📞 Support

For questions or issues:

Check the Kokoro Discussions
Review the Gradio Docs
Open an issue in this repository

Built with ❤️ for academic learning and open-source AI