Spaces:

YashChowdhary
/

Text_To_Speech

Running

Feature	Kokoro-82M	Traditional Large Models
Parameters	82 million	1-3 billion
Model Size	~330 MB	5-15 GB
Quality	Near state-of-the-art	State-of-the-art
Speed (CPU)	3-11× real-time	0.1-0.5× real-time
License	Apache 2.0 (Free)	Often proprietary

Kokoro proves that smaller models can achieve remarkable quality when properly designed.

Project Goals

Learn how modern TTS systems work
Understand the complete pipeline from text to audio
Build a functional, deployable application
Demonstrate practical ML engineering skills

2. Project Architecture Overview

High-Level System Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                        USER INTERFACE (Gradio)                          │
│  ┌─────────────┬──────────────┬──────────────┬─────────────────────┐   │
│  │ Text Input  │ Voice Select │ Style Preset │ Advanced Controls   │   │
│  │             │  (28 voices) │  (7 styles)  │ Speed/Pitch/Pause   │   │
│  └─────────────┴──────────────┴──────────────┴─────────────────────┘   │
└─────────────────────────────────┬───────────────────────────────────────┘
                                  │ User clicks "Generate"
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                      TEXT PREPROCESSING                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ 1. Clean whitespace and normalize                                │   │
│  │ 2. Expand abbreviations (Dr. → Doctor, etc.)                    │   │
│  │ 3. Enforce character limits (max 5000 chars)                    │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────┬───────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    KOKORO TTS ENGINE (KPipeline)                        │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ STAGE 1: Grapheme-to-Phoneme (G2P) via Misaki                   │   │
│  │   "Hello world" → "həlˈO wˈɜɹld"                                │   │
│  ├─────────────────────────────────────────────────────────────────┤   │
│  │ STAGE 2: Voice Pack Loading                                      │   │
│  │   Load speaker embedding (e.g., af_heart.pt → 523KB tensor)     │   │
│  ├─────────────────────────────────────────────────────────────────┤   │
│  │ STAGE 3: Neural Audio Synthesis                                  │   │
│  │   StyleTTS2 Decoder + ISTFTNet Vocoder → Raw Audio Waveform     │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────┬───────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                      AUDIO POST-PROCESSING                              │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ 1. Combine audio segments                                        │   │
│  │ 2. Insert pauses between sentences                               │   │
│  │ 3. Apply pitch shift (if requested)                             │   │
│  │ 4. Normalize volume to -3dB peak                                │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────┬───────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         AUDIO OUTPUT                                    │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │ Format: 32-bit float WAV @ 24,000 Hz sample rate                │   │
│  │ Playback in browser + Download capability                       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

Data Flow Summary

Text (string)
    ↓
Phonemes (IPA symbols)
    ↓
Token IDs (integers)
    ↓
Neural Network Processing
    ↓
Audio Waveform (numpy array)
    ↓
Post-processed Audio (normalized, with pauses)
    ↓
Playable Audio File

3. Understanding Text-to-Speech (TTS)

What is TTS?

Text-to-Speech is the technology that converts written text into spoken audio. Modern TTS systems use deep learning to produce remarkably natural-sounding speech.

The Evolution of TTS

Generation	Era	Technology	Example
1st	1960s-1980s	Rule-based synthesis	DECtalk
2nd	1990s-2000s	Concatenative (splice recordings)	AT&T Natural Voices
3rd	2010s	Statistical parametric (HMM)	Festival
4th	2016+	Neural networks (Deep Learning)	Tacotron, WaveNet
5th	2023+	Transformer-based	Kokoro, XTTS, Bark

Key Concepts in Modern TTS

3.1 Graphemes vs Phonemes

Graphemes are the written letters/characters:

"Hello" = H + e + l + l + o (5 graphemes)

Phonemes are the sound units:

"Hello" = /h/ + /ə/ + /l/ + /oʊ/ (4 phonemes)

Why phonemes matter: English spelling is inconsistent!

"though", "through", "thought", "tough" — all different sounds for "ough"
The model needs consistent sound representations, not arbitrary spellings

3.2 The TTS Pipeline (Traditional)

┌──────────┐    ┌─────────────┐    ┌───────────┐    ┌─────────┐
│   Text   │───▶│ Text        │───▶│  Acoustic │───▶│ Vocoder │───▶ Audio
│          │    │ Analysis    │    │  Model    │    │         │
└──────────┘    └─────────────┘    └───────────┘    └─────────┘
                     │                   │               │
                     ▼                   ▼               ▼
               - G2P conversion    - Mel spectrograms   - Waveform
               - Tokenization      - Duration           - From spectrogram
               - Normalization     - Pitch/prosody      - to audio

3.3 Kokoro's Innovation: Decoder-Only Architecture

Traditional TTS uses a two-stage approach:

Encoder: Text → Hidden representation
Decoder: Hidden representation → Audio

Kokoro simplifies this:

Decoder Only: Phonemes → Audio (directly!)

This eliminates computational overhead and reduces model size.

4. The Kokoro-82M Model Deep Dive

4.1 Model Specifications

Attribute	Value
Full Name	Kokoro-82M v1.0
Parameters	82 million
Architecture	StyleTTS2 + ISTFTNet (Decoder-only)
Input	Phoneme tokens (up to 510 tokens)
Output	24kHz audio waveform
Voice Packs	54 voices across 8 languages
Training Data	<100 hours (v0.19), few hundred hours (v1.0)
Training Cost	~$1000 total (1000 A100 GPU hours)
License	Apache 2.0

4.2 Architecture Components

StyleTTS2 (The Brain)

StyleTTS2 is the foundation architecture, published in the paper:

"StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models"
(Li et al., 2023) - arXiv:2306.07691

Key innovations:

Style as latent variables: Speech style (emotion, prosody) is modeled as random variables
Adversarial training: Uses discriminators trained on real speech to improve naturalness
No reference audio needed: Can generate appropriate styles from text alone

ISTFTNet (The Voice)

ISTFTNet is the vocoder component, from the paper:

"iSTFTNet: Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform"
(Kaneko et al., 2022) - arXiv:2203.02395

Key innovations:

Direct waveform generation: Uses inverse Short-Time Fourier Transform
Lightweight: Much smaller than GAN-based vocoders like HiFi-GAN
Fast inference: Optimized for real-time synthesis

How They Work Together

┌─────────────────────────────────────────────────────────────────┐
│                     KOKORO-82M ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Phoneme Tokens ──▶ ┌─────────────────────────────────────┐   │
│   (from Misaki)      │      StyleTTS2 Transformer          │   │
│                      │      ─────────────────────          │   │
│                      │  • Self-attention layers            │   │
│   Voice Embedding ──▶│  • Style conditioning               │   │
│   (speaker identity) │  • Duration prediction              │   │
│                      │  • Prosody modeling                 │   │
│                      └──────────────┬──────────────────────┘   │
│                                     │                           │
│                                     ▼                           │
│                      ┌─────────────────────────────────────┐   │
│                      │        ISTFTNet Vocoder             │   │
│                      │        ─────────────────            │   │
│                      │  • Mel-spectrogram generation       │   │
│                      │  • Inverse STFT                     │   │
│                      │  • Waveform synthesis               │   │
│                      └──────────────┬──────────────────────┘   │
│                                     │                           │
│                                     ▼                           │
│                            Audio Waveform                       │
│                          (24kHz, 32-bit float)                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

4.3 Voice Packs Explained

Each voice is stored as a voice embedding (also called "speaker embedding"):

File format: .pt (PyTorch tensor)
Size: ~523KB per voice
Content: A 256-dimensional vector that captures speaker identity

# How voice packs work internally
voice_embedding = load("af_heart.pt")  # Shape: (512, 1, 256)
# This embedding tells the model HOW to speak, not WHAT to speak

The naming convention:

af_heart
│├── voice name
│└── gender (f=female, m=male)
└── accent (a=American, b=British)

4.4 Why 82M Parameters is Enough

Traditional wisdom: "bigger models = better quality"

Kokoro challenges this by:

Efficient architecture: Decoder-only removes encoder overhead
Phoneme input: G2P preprocessing reduces model's job (doesn't need to learn spelling)
Quality training data: Small but high-quality dataset beats large noisy datasets
Focused scope: Optimized for TTS only, not multi-task

Comparison:

Model	Parameters	Quality Ranking
Kokoro-82M	82M	#1 (TTS Arena)
XTTS	467M	#2-3
MetaVoice	1.2B	#3-4
Bark	1B+	#4-5

5. File-by-File Breakdown

Project Structure

kokoro-tts-app/
├── app.py              # Main application (754 lines)
├── requirements.txt    # Python dependencies
├── packages.txt        # System dependencies
├── README.md           # Documentation + HF Space config
├── examples.py         # Standalone usage examples
└── .gitignore          # Git ignore rules

5.1 `app.py` — The Main Application

This is the heart of the project. Let's break it down by sections:

Section 1: Imports and Configuration (Lines 1-170)

"""
Kokoro TTS - Academic Text-to-Speech Application
================================================
Created by: Yash Chowdhary
"""

import gradio as gr      # Web interface
import numpy as np       # Numerical operations
import soundfile as sf   # Audio file I/O
import re                # Regular expressions
from typing import Optional, Tuple  # Type hints
from dataclasses import dataclass   # Data structures
from kokoro import KPipeline        # The TTS engine

What each import does:

Import	Purpose
`gradio`	Creates the web UI (buttons, sliders, audio player)
`numpy`	Handles audio as numerical arrays
`soundfile`	Reads/writes audio files
`re`	Pattern matching for text preprocessing
`typing`	Adds type hints for better code documentation
`dataclasses`	Creates clean data structures (like StylePreset)
`kokoro.KPipeline`	The actual TTS engine

Section 2: Voice Catalog (Lines 38-74)

VOICE_CATALOG = {
    # voice_id -> (display_name, gender, accent, quality_grade, description)
    "af_heart": ("Heart ❤️", "Female", "American", "A", "Premium quality, warm and natural"),
    "af_bella": ("Bella 🔥", "Female", "American", "A-", "Clear and expressive"),
    # ... 26 more voices
}

This dictionary maps voice IDs to their metadata. The quality grades (A, B, C, D) are from the official Kokoro documentation and reflect training data quality.

Section 3: Style Presets (Lines 77-145)

@dataclass
class StylePreset:
    """Defines a style preset with associated audio parameters."""
    name: str
    description: str
    speed: float           # 0.5 to 2.0
    pitch_shift: float     # semitones (-5 to +5)
    pause_multiplier: float
    recommended_voices: list

STYLE_PRESETS = {
    "dramatic": StylePreset(
        name="Dramatic / Horror",
        description="Slower, deeper voice for suspenseful content",
        speed=0.85,
        pitch_shift=-2,       # Lower pitch = deeper voice
        pause_multiplier=1.5, # Longer pauses = more tension
        recommended_voices=["am_fenrir", "am_onyx", "bm_george"]
    ),
    # ... more presets
}

Why use @dataclass?

Automatically generates __init__, __repr__, and other methods
Cleaner than regular classes for data containers
Type hints are enforced

Section 4: Audio Processing Functions (Lines 150-275)

These are utility functions for manipulating audio:

pitch_shift_audio() — Changes the pitch without changing speed

def pitch_shift_audio(audio: np.ndarray, sample_rate: int, semitones: float) -> np.ndarray:
    """
    Shift pitch using resampling technique.
    
    How it works:
    1. To raise pitch: Speed up audio, then slow it back down
    2. To lower pitch: Slow down audio, then speed it back up
    
    The math: factor = 2^(semitones/12)
    - 12 semitones = 1 octave = 2x frequency
    - 1 semitone ≈ 1.059x frequency
    """
    factor = 2 ** (semitones / 12)
    # ... resampling logic

insert_pauses() — Adds silence between segments

def insert_pauses(audio_segments: list, pause_duration_ms: int, sample_rate: int):
    """
    Insert silence between audio segments.
    
    pause_duration_ms=300 at 24000Hz = 7200 samples of zeros
    """
    pause_samples = int(sample_rate * pause_duration_ms / 1000)
    silence = np.zeros(pause_samples, dtype=np.float32)
    # ... concatenation logic

normalize_audio() — Ensures consistent volume

def normalize_audio(audio: np.ndarray, target_db: float = -3.0):
    """
    Normalize to target dB level.
    
    Why -3dB? Leaves headroom to prevent clipping while
    maintaining good volume.
    
    Formula: gain = 10^(target_db/20) / peak_amplitude
    """

Section 5: TTS Engine Class (Lines 280-410)

class KokoroTTSEngine:
    """
    Wrapper class for Kokoro TTS with additional processing capabilities.
    """
    
    def __init__(self):
        # Initialize pipelines for both accents
        self.pipelines = {
            'a': KPipeline(lang_code='a'),  # American English
            'b': KPipeline(lang_code='b'),  # British English
        }
        
        # Add custom pronunciation for "Kokoro"
        self.pipelines['a'].g2p.lexicon.golds['kokoro'] = 'kˈOkəɹO'

Why two pipelines?

American and British English have different phoneme sets
Different pronunciations: "schedule" = /ˈskedʒuːl/ (US) vs /ˈʃedjuːl/ (UK)
The voice ID's first letter determines which pipeline to use

Section 6: Gradio Interface (Lines 550-754)

with gr.Blocks(
    title="Kokoro TTS - Academic Text-to-Speech",
    theme=gr.themes.Soft(),
    css="..."  # Custom styling
) as demo:
    
    # Header
    gr.Markdown("# 🎙️ Kokoro TTS...")
    
    # Input controls
    text_input = gr.Textbox(label="Text to Synthesize")
    voice_dropdown = gr.Dropdown(choices=..., label="Voice")
    
    # Output
    audio_output = gr.Audio(label="Generated Audio")
    
    # Event handler
    generate_btn.click(
        fn=generate_speech,
        inputs=[text_input, voice_dropdown, ...],
        outputs=[audio_output]
    )

5.2 `requirements.txt` — Python Dependencies

kokoro>=0.9.4          # The TTS model and pipeline
soundfile>=0.12.1      # Audio file reading/writing
numpy>=1.24.0          # Numerical array operations
gradio==5.50.0         # Web interface framework
torch>=2.0.0           # Deep learning framework
torchaudio>=2.0.0      # Audio processing for PyTorch
misaki[en]>=0.9.0      # Grapheme-to-Phoneme conversion

Why these specific versions?

gradio==5.50.0: Pinned to avoid breaking changes in Gradio 6.x
kokoro>=0.9.4: Requires Python 3.10-3.12 (not 3.13!)
misaki[en]: The [en] installs English language support

5.3 `packages.txt` — System Dependencies

espeak-ng      # Fallback phonemizer for out-of-vocabulary words
ffmpeg         # Audio encoding/decoding
libsndfile1    # C library for reading/writing audio files

Important: This file must contain ONLY package names, no comments!

5.4 `README.md` — Documentation + Configuration

The README serves two purposes:

YAML Frontmatter (lines 1-19): Configures Hugging Face Spaces

---
title: Kokoro TTS - Academic Text-to-Speech
emoji: 🎙️
sdk: gradio
sdk_version: 5.50.0
python_version: "3.10"    # Critical! Kokoro needs Python <3.13
suggested_hardware: cpu-basic
---

Documentation (rest of file): User guide and technical details

6. Dependencies & Libraries Explained

6.1 Kokoro (`kokoro` package)

What it is: The main TTS library that wraps the Kokoro-82M model.

Key classes:

from kokoro import KPipeline, KModel

# KPipeline: High-level interface (recommended)
pipeline = KPipeline(lang_code='a')  # 'a'=American, 'b'=British

# Generate speech
for graphemes, phonemes, audio in pipeline(text, voice='af_heart', speed=1.0):
    # graphemes: original text chunk
    # phonemes: IPA representation
    # audio: numpy array of audio samples

# KModel: Low-level model access
model = KModel().to('cpu').eval()
audio = model(phoneme_tokens, voice_embedding, speed)

Internal workflow:

KPipeline
    │
    ├─▶ Misaki G2P ──▶ Text to Phonemes
    │
    ├─▶ Voice Loader ──▶ Load speaker embedding
    │
    └─▶ KModel ──▶ Generate audio

6.2 Misaki (`misaki` package)

What it is: Grapheme-to-Phoneme (G2P) library designed for Kokoro.

How it works:

from misaki import en

g2p = en.G2P(trf=False, british=False, fallback=None)
text = "Hello, world!"
phonemes, tokens = g2p(text)
# phonemes: "həlˈO, wˈɜɹld!"
# tokens: list of token IDs for the model

G2P Strategy (Hybrid approach):

Input Text
    │
    ▼
┌─────────────────────────────────────┐
│  1. Dictionary Lookup (Gold/Silver) │ ◄── Known words
└─────────────────────────────────────┘
    │ Unknown word?
    ▼
┌─────────────────────────────────────┐
│  2. Rule-based Fallback             │ ◄── espeak-ng
└─────────────────────────────────────┘
    │ Still unknown?
    ▼
┌─────────────────────────────────────┐
│  3. Neural Network Fallback         │ ◄── BART-based model
└─────────────────────────────────────┘
    │
    ▼
Phoneme Output

Custom pronunciations:

# Use Markdown-style syntax for custom pronunciation
text = "[Kokoro](/kˈOkəɹO/) is a TTS model."
# The /slashes/ contain IPA phonemes

Phoneme inventory:

Misaki uses 49 phonemes for English:

41 shared between US and UK
4 American-only (like the "r" in "car")
4 British-only (like the "ɒ" in "lot")

6.3 PyTorch (`torch` package)

What it is: The deep learning framework that runs the neural network.

Role in this project:

Loads model weights (.pth files)
Runs forward passes through the network
Handles tensor operations on CPU/GPU

import torch

# Model inference
with torch.no_grad():  # Disable gradient computation (faster)
    audio_tensor = model(phonemes, voice_embedding, speed)
    audio_numpy = audio_tensor.numpy()  # Convert to numpy for playback

6.4 Gradio (`gradio` package)

What it is: A Python library for building web interfaces for ML models.

Key concepts:

import gradio as gr

# Components (UI elements)
text_input = gr.Textbox(label="Input")
slider = gr.Slider(minimum=0, maximum=1)
audio_output = gr.Audio()

# Blocks (layout container)
with gr.Blocks() as demo:
    with gr.Row():      # Horizontal layout
        with gr.Column():  # Vertical layout
            # Components here
    
    # Event handlers
    button.click(
        fn=my_function,     # Python function to call
        inputs=[text_input],  # Input components
        outputs=[audio_output]  # Output components
    )

# Launch
demo.launch()

Why Gradio?

No frontend code needed (HTML/CSS/JS)
Automatic API generation
Easy deployment to Hugging Face Spaces
Built-in audio player with download

6.5 NumPy (`numpy` package)

What it is: Fundamental library for numerical computing in Python.

Role in audio processing:

import numpy as np

# Audio is represented as a 1D array of floats
audio = np.array([0.0, 0.1, 0.2, -0.1, ...], dtype=np.float32)

# Sample rate: 24000 Hz means 24000 samples = 1 second
# 1 minute of audio = 24000 * 60 = 1,440,000 samples

# Creating silence
silence = np.zeros(24000, dtype=np.float32)  # 1 second of silence

# Concatenating audio
combined = np.concatenate([audio1, silence, audio2])

# Normalization
peak = np.max(np.abs(audio))
normalized = audio / peak * 0.9  # Scale to 90% of max

6.6 SoundFile (`soundfile` package)

What it is: Library for reading and writing audio files.

import soundfile as sf

# Write audio to file
sf.write('output.wav', audio_array, samplerate=24000)

# Read audio from file
audio, samplerate = sf.read('input.wav')

Supported formats: WAV, FLAC, OGG, and more.

7. The TTS Pipeline: Step-by-Step

Let's trace what happens when you click "Generate Speech":

Step 1: Text Input Received

text = "Hello, world! This is Kokoro speaking."

Step 2: Text Preprocessing

def preprocess_text(text):
    # Clean whitespace
    text = re.sub(r'\s+', ' ', text.strip())
    
    # Expand abbreviations
    text = text.replace("Dr.", "Doctor")
    text = text.replace("Mr.", "Mister")
    # etc.
    
    return text

# Result: "Hello, world! This is Kokoro speaking."

Step 3: Pipeline Selection

voice = "af_heart"  # American female
lang_code = voice[0]  # 'a' = American

pipeline = pipelines['a']  # American English pipeline

Step 4: Grapheme-to-Phoneme Conversion

# Inside the pipeline:
phonemes = pipeline.g2p("Hello, world!")
# Result: "həlˈO, wˈɜɹld!"

What happens inside G2P:

"Hello" 
    → Lookup in dictionary
    → Found: "həlˈO"

"world"
    → Lookup in dictionary  
    → Found: "wˈɜɹld"

Punctuation preserved: ", !"

Step 5: Tokenization

# Phonemes converted to token IDs
tokens = [50, 157, 43, 135, ...]  # Integer IDs

# Each phoneme has a unique ID (defined in config.json)
# Maximum context: 510 tokens

Step 6: Voice Embedding Loading

# Load the voice pack
voice_pack = pipeline.load_voice("af_heart")
# Result: Tensor of shape (512, 1, 256)

# Select embedding based on token count
ref_s = voice_pack[len(tokens) - 1]

Step 7: Neural Network Forward Pass

# The actual synthesis
audio = model(
    tokens,        # What to say (phoneme IDs)
    ref_s,         # How to say it (voice embedding)
    speed          # How fast (1.0 = normal)
)
# Result: Tensor of audio samples

Inside the model:

Tokens + Voice Embedding
         │
         ▼
┌─────────────────────┐
│ Transformer Layers  │  ← Self-attention, style modeling
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│ Duration Predictor  │  ← How long each sound lasts
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│ Mel-Spectrogram     │  ← Intermediate representation
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│ ISTFTNet Vocoder    │  ← Convert to waveform
└─────────────────────┘
         │
         ▼
    Audio Waveform

Step 8: Audio Post-Processing

# Combine segments
audio_segments = [seg1, seg2, seg3]
combined = insert_pauses(audio_segments, pause_ms=300, sample_rate=24000)

# Apply pitch shift if requested
if pitch_shift != 0:
    combined = pitch_shift_audio(combined, 24000, pitch_shift)

# Normalize volume
combined = normalize_audio(combined, target_db=-3.0)

Step 9: Output

return (24000, combined)  # (sample_rate, audio_array)
# Gradio displays this in the Audio component

8. Code Walkthrough

8.1 The `generate_speech()` Function

This is the main callback function that Gradio calls:

def generate_speech(
    text: str,           # User's input text
    voice: str,          # Voice ID like "af_heart"
    style: str,          # Style preset like "dramatic"
    speed: float,        # Speed multiplier
    pitch: float,        # Pitch shift in semitones
    pause: int,          # Pause duration in ms
    use_style_defaults: bool,  # Use preset values?
) -> Tuple[int, np.ndarray]:
    """
    Main generation function for Gradio interface.
    
    Returns:
        Tuple of (sample_rate, audio_array) for Gradio Audio component
    """
    
    # Validation
    if not text.strip():
        gr.Warning("Please enter some text to synthesize.")
        return None
    
    try:
        if use_style_defaults:
            # Use style preset parameters
            sample_rate, audio = tts_engine.generate_with_style(
                text=text,
                voice=voice,
                style_preset=style,
            )
        else:
            # Use manual control parameters
            sample_rate, audio = tts_engine.generate(
                text=text,
                voice=voice,
                speed=speed,
                pitch_shift=pitch,
                pause_between_sentences_ms=pause,
            )
        
        return (sample_rate, audio)
    
    except Exception as e:
        gr.Error(f"Generation failed: {str(e)}")
        return None

8.2 The `KokoroTTSEngine.generate()` Method

def generate(
    self,
    text: str,
    voice: str = "af_heart",
    speed: float = 1.0,
    pitch_shift: float = 0.0,
    pause_between_sentences_ms: int = 300,
) -> Tuple[int, np.ndarray]:
    """
    Generate speech from text with full parameter control.
    """
    
    # 1. Preprocess and validate
    text = preprocess_text(text.strip()[:MAX_CHAR_LIMIT])
    if not text:
        return SAMPLE_RATE, np.zeros(1, dtype=np.float32)
    
    # 2. Clamp parameters to valid ranges
    speed = max(0.5, min(2.0, speed))
    pitch_shift = max(-5, min(5, pitch_shift))
    
    # 3. Select pipeline based on voice accent
    lang_code = voice[0] if voice[0] in self.pipelines else 'a'
    pipeline = self.pipelines[lang_code]
    
    # 4. Generate audio segments
    audio_segments = []
    try:
        # The pipeline yields (graphemes, phonemes, audio) tuples
        for _, phonemes, audio in pipeline(text, voice=voice, speed=speed):
            if audio is not None:
                # Convert PyTorch tensor to numpy if needed
                audio_np = audio.numpy() if hasattr(audio, 'numpy') else audio
                audio_segments.append(audio_np)
    except Exception as e:
        print(f"Generation error: {e}")
        return SAMPLE_RATE, np.zeros(1, dtype=np.float32)
    
    if not audio_segments:
        return SAMPLE_RATE, np.zeros(1, dtype=np.float32)
    
    # 5. Post-process: combine with pauses
    combined_audio = insert_pauses(
        audio_segments, 
        pause_between_sentences_ms, 
        SAMPLE_RATE
    )
    
    # 6. Post-process: apply pitch shift
    if pitch_shift != 0:
        combined_audio = pitch_shift_audio(
            combined_audio, 
            SAMPLE_RATE, 
            pitch_shift
        )
    
    # 7. Post-process: normalize volume
    combined_audio = normalize_audio(combined_audio)
    
    return SAMPLE_RATE, combined_audio

8.3 Pitch Shifting Explained

def pitch_shift_audio(audio: np.ndarray, sample_rate: int, semitones: float) -> np.ndarray:
    """
    Shift the pitch of audio by a given number of semitones.
    
    THEORY:
    -------
    Pitch and time are linked. If you play audio faster:
    - Duration decreases
    - Pitch increases (chipmunk effect)
    
    To change pitch WITHOUT changing duration:
    1. Resample to change pitch (also changes duration)
    2. Resample again to restore original duration
    
    MATH:
    -----
    - 1 octave = 12 semitones = 2x frequency
    - factor = 2^(semitones/12)
    - +1 semitone = 2^(1/12) ≈ 1.059x frequency
    - -1 semitone = 2^(-1/12) ≈ 0.944x frequency
    """
    
    if semitones == 0:
        return audio  # No change needed
    
    # Calculate the pitch shift factor
    factor = 2 ** (semitones / 12)
    
    original_length = len(audio)
    
    # Step 1: Resample to shift pitch
    # If factor > 1 (raise pitch), we need MORE samples
    # If factor < 1 (lower pitch), we need FEWER samples
    new_length = int(original_length / factor)
    
    # Create indices for interpolation
    indices = np.linspace(0, original_length - 1, new_length)
    
    # Linear interpolation (simple but effective)
    shifted = np.interp(indices, np.arange(original_length), audio)
    
    # Step 2: Resample back to original length
    # This restores the original duration
    final_indices = np.linspace(0, len(shifted) - 1, original_length)
    result = np.interp(final_indices, np.arange(len(shifted)), shifted)
    
    return result.astype(np.float32)

Visual explanation:

Original audio (1 second, 24000 samples):
[████████████████████████████████████████████████]

Raise pitch by 2 semitones (factor = 1.122):
Step 1 - Stretch to 21384 samples:
[██████████████████████████████████████████]

Step 2 - Compress back to 24000:
[████████████████████████████████████████████████]
 └── Same duration, higher pitch!

Lower pitch by 2 semitones (factor = 0.891):
Step 1 - Compress to 26935 samples:
[██████████████████████████████████████████████████████]

Step 2 - Stretch back to 24000:
[████████████████████████████████████████████████]
 └── Same duration, lower pitch!

9. Audio Processing Concepts

9.1 Digital Audio Basics

What is digital audio?

Sound is a continuous wave of air pressure. Digital audio represents this as discrete samples taken at regular intervals.

Analog Sound Wave:
    ╱╲    ╱╲    ╱╲
   ╱  ╲  ╱  ╲  ╱  ╲
──╱────╲╱────╲╱────╲──
  
Digital Samples (at regular intervals):
  │  │  │  │  │  │  │  │  │  │
  ●  ●     ●     ●     ●     ●
     ●        ●        ●
        ●        ●        ●

Key parameters:

Parameter	Definition	Kokoro Value
Sample Rate	Samples per second	24,000 Hz
Bit Depth	Precision per sample	32-bit float
Channels	Mono vs Stereo	Mono (1 channel)

Nyquist Theorem:

To capture frequency F, sample at ≥2F
Human hearing: up to ~20kHz
24kHz sample rate captures up to 12kHz (adequate for speech)

9.2 Audio as NumPy Arrays

import numpy as np

# Audio is a 1D array of floats
audio = np.array([0.0, 0.1, 0.15, 0.1, 0.0, -0.1, -0.15, -0.1, 0.0, ...])

# Value range: -1.0 to +1.0 (normalized)
# 0.0 = silence
# +1.0 = maximum positive pressure
# -1.0 = maximum negative pressure

# Duration calculation
sample_rate = 24000  # Hz
num_samples = len(audio)
duration_seconds = num_samples / sample_rate

# Example: 48000 samples at 24kHz = 2 seconds

9.3 Sample Rate Explained

Sample Rate = 24,000 Hz means:
- 24,000 measurements per second
- Each sample is 1/24000 = 0.0000417 seconds apart

Timeline:
0s              1s              2s
|───────────────|───────────────|
24000 samples   24000 samples   ...

Higher sample rate:
+ Better frequency reproduction
+ Larger file size
+ More processing required

Lower sample rate:
+ Smaller files
+ Faster processing
- Possible quality loss

9.4 Audio Normalization

def normalize_audio(audio: np.ndarray, target_db: float = -3.0) -> np.ndarray:
    """
    Why normalize?
    1. Consistent volume across different generations
    2. Prevent clipping (distortion when values exceed ±1.0)
    3. Optimal playback volume
    
    Why -3dB (not 0dB)?
    - Leaves "headroom" for peaks
    - Prevents distortion on some playback systems
    - Industry standard practice
    """
    
    # Find the peak (maximum absolute value)
    peak = np.max(np.abs(audio))
    
    if peak == 0:
        return audio  # Silent audio, nothing to normalize
    
    # Convert dB to amplitude
    # dB = 20 * log10(amplitude)
    # amplitude = 10^(dB/20)
    target_amplitude = 10 ** (target_db / 20)
    # -3dB → 10^(-3/20) ≈ 0.708
    
    # Calculate required gain
    gain = target_amplitude / peak
    
    # Apply gain
    return (audio * gain).astype(np.float32)

Decibels (dB) explained:

dB Scale (relative to maximum):

 0 dB ─────── Maximum (1.0 amplitude)
-3 dB ─────── 0.708 amplitude (our target)
-6 dB ─────── 0.5 amplitude
-12 dB ────── 0.25 amplitude
-20 dB ────── 0.1 amplitude
-40 dB ────── 0.01 amplitude
-60 dB ────── 0.001 amplitude (nearly silent)

9.5 Silence and Pauses

def create_silence(duration_ms: int, sample_rate: int) -> np.ndarray:
    """
    Create a silent audio segment.
    
    Silence = array of zeros
    """
    num_samples = int(sample_rate * duration_ms / 1000)
    return np.zeros(num_samples, dtype=np.float32)

# 300ms pause at 24kHz
pause = create_silence(300, 24000)
# Result: array of 7200 zeros

10. Gradio Interface Explained

10.1 Component Types Used

# TEXT INPUT
text_input = gr.Textbox(
    label="📝 Text to Synthesize",
    placeholder="Enter your text here...",
    lines=6,        # Height in lines
    max_lines=15,   # Max expandable height
    info="Maximum 5000 characters"  # Helper text
)

# DROPDOWN SELECTION
voice_dropdown = gr.Dropdown(
    choices=[("Display Name", "value"), ...],  # (label, value) pairs
    value="af_heart",   # Default selection
    label="🎭 Voice",
    info="Select a voice"
)

# SLIDER
speed_slider = gr.Slider(
    minimum=0.5,    # Min value
    maximum=2.0,    # Max value
    value=1.0,      # Default
    step=0.05,      # Increment
    label="🏃 Speed"
)

# CHECKBOX
use_defaults = gr.Checkbox(
    label="Use Style Preset Defaults",
    value=True,     # Default checked
    info="When checked, style preset values override manual controls"
)

# AUDIO OUTPUT
audio_output = gr.Audio(
    label="🔊 Generated Audio",
    type="numpy",       # Expects (sample_rate, numpy_array)
    interactive=False,  # User can't upload
    autoplay=True       # Auto-play when generated
)

# MARKDOWN
gr.Markdown("# Title")  # Rendered as HTML

10.2 Layout System

with gr.Blocks() as demo:
    
    # Row: Horizontal layout
    with gr.Row():
        component1  # Left
        component2  # Right
    
    # Column: Vertical layout
    with gr.Column(scale=1):  # scale controls relative width
        component3  # Top
        component4  # Bottom
    
    # Accordion: Collapsible section
    with gr.Accordion("Advanced Options", open=False):
        component5
        component6
    
    # Tabs: Tabbed interface
    with gr.Tab("Tab 1"):
        component7
    with gr.Tab("Tab 2"):
        component8

10.3 Event Handling

# Button click event
generate_btn.click(
    fn=generate_speech,           # Function to call
    inputs=[text, voice, speed],  # Input components
    outputs=[audio_output]        # Output components
)

# Dropdown change event
style_dropdown.change(
    fn=update_style_info,
    inputs=[style_dropdown],
    outputs=[info_markdown]
)

# Chained events (update multiple things)
style_dropdown.change(
    fn=update_controls,
    inputs=[style_dropdown, use_defaults],
    outputs=[speed_slider, pitch_slider, pause_slider]
)

10.4 CSS Customization

with gr.Blocks(
    css="""
        /* Custom class styling */
        .main-title {
            text-align: center;
            margin-bottom: 1rem;
        }
        
        /* Use CSS variables for theme compatibility */
        .info-box {
            border: 1px solid var(--border-color-primary);
            color: var(--body-text-color);
        }
    """
) as demo:
    gr.Markdown("...", elem_classes=["main-title"])

11. Deployment on Hugging Face Spaces

11.1 What is Hugging Face Spaces?

Hugging Face Spaces is a free hosting platform for ML demos:

Free CPU instances (2 vCPU, 16GB RAM)
GPU instances available (paid)
Git-based deployment
Automatic dependency installation

11.2 Configuration via README.md

The YAML frontmatter in README.md configures your Space:

---
# Required
title: Kokoro TTS                    # Display name
sdk: gradio                          # Framework (gradio/streamlit/static)
app_file: app.py                     # Entry point

# Recommended
sdk_version: 5.50.0                  # Pin Gradio version
python_version: "3.10"               # Pin Python version (CRITICAL for Kokoro)

# Optional
emoji: 🎙️                            # Favicon
colorFrom: blue                      # Gradient start
colorTo: purple                      # Gradient end
pinned: false                        # Pin to profile?
license: apache-2.0                  # License
suggested_hardware: cpu-basic        # Default hardware
short_description: TTS with 28 voices
tags:
  - text-to-speech
  - tts
---

11.3 Build Process

When you push to the Space repository:

1. HF detects changes
   │
   ▼
2. Reads README.md frontmatter
   │
   ▼
3. Creates Docker container
   │
   ├─▶ Base image: python:3.10
   │
   ├─▶ Installs packages.txt (apt-get)
   │   └── espeak-ng, ffmpeg, libsndfile1
   │
   └─▶ Installs requirements.txt (pip)
       └── kokoro, gradio, torch, etc.
   │
   ▼
4. Runs: python app.py
   │
   ▼
5. Exposes port 7860
   │
   ▼
6. Space is live!

11.4 Common Deployment Issues

Issue	Cause	Solution
Build fails on packages.txt	Comments in file	Remove all comments
Kokoro not found	Python 3.13	Set `python_version: "3.10"`
Gradio API error	Gradio 6.x breaking changes	Pin `gradio==5.50.0`
Out of memory	Model too large	Use CPU basic, optimize code
Timeout on load	Slow model download	Add loading indicator

11.5 Monitoring Your Space

Logs: Click "Logs" tab to see stdout/stderr
Factory Rebuild: Settings → Factory Reboot (clears cache)
Container Restart: Settings → Restart Space

12. Troubleshooting & Common Issues

12.1 "Kokoro not found" Error

ERROR: Could not find a version that satisfies the requirement kokoro>=0.9.4

Cause: Kokoro requires Python 3.10-3.12, but HF Spaces defaults to 3.13

Solution: Add to README.md frontmatter:

python_version: "3.10"

12.2 Gradio "unexpected keyword argument" Error

TypeError: Blocks.launch() got an unexpected keyword argument 'show_api'

Cause: Gradio 6.x removed/moved several parameters

Solution: Pin Gradio version:

# requirements.txt
gradio==5.50.0

12.3 "Unable to locate package" Error

E: Unable to locate package # System dependencies

Cause: Comments in packages.txt

Solution: Remove ALL comments:

# packages.txt (correct)
espeak-ng
ffmpeg
libsndfile1

12.4 Audio Not Playing

Possible causes:

Audio array is empty (check generation succeeded)
Wrong return format (must be (sample_rate, array))
Sample rate mismatch

Debug:

print(f"Audio shape: {audio.shape}")
print(f"Audio range: [{audio.min()}, {audio.max()}]")
print(f"Sample rate: {sample_rate}")

12.5 Model Loading Timeout

Cause: First run downloads ~350MB model

Solution: Add loading indicator or pre-cache:

print("Loading Kokoro TTS Engine...")  # Shows in logs
pipeline = KPipeline(lang_code='a')    # Downloads model
print("Ready!")

13. Further Learning Resources

Official Documentation

Resource	URL
Kokoro Model Card	https://huggingface.co/hexgrad/Kokoro-82M
Misaki G2P	https://github.com/hexgrad/misaki
Gradio Docs	https://gradio.app/docs
HF Spaces Docs	https://huggingface.co/docs/hub/spaces

Research Papers

Paper	Topic
StyleTTS 2	Core architecture
ISTFTNet	Vocoder
G2P Blog	Grapheme-to-Phoneme

Video Tutorials

Search for:

"Kokoro TTS tutorial"
"Gradio machine learning app"
"Hugging Face Spaces deployment"

Related Projects

Project	Description
Coqui TTS	Open-source TTS library
Bark	Transformer-based TTS
VITS	Fast end-to-end TTS
Piper	Lightweight local TTS

14. Glossary of Terms

Term	Definition
TTS	Text-to-Speech: Converting written text to spoken audio
G2P	Grapheme-to-Phoneme: Converting letters to sounds
Grapheme	Written unit (letter or character)
Phoneme	Sound unit in a language
IPA	International Phonetic Alphabet: Standard phoneme notation
Vocoder	Voice encoder: Converts features to audio waveform
Mel-spectrogram	Visual representation of audio frequencies over time
STFT	Short-Time Fourier Transform: Converts audio to spectrogram
iSTFT	Inverse STFT: Converts spectrogram back to audio
Embedding	Dense vector representation (e.g., voice identity)
Sample Rate	Audio samples per second (Hz)
Bit Depth	Precision of each audio sample
Normalization	Adjusting audio volume to target level
Decibels (dB)	Logarithmic unit for audio levels
Transformer	Neural network architecture using attention
Inference	Running a trained model to make predictions
Latent Space	Compressed representation learned by a model
Fine-tuning	Adapting a pre-trained model to new data
API	Application Programming Interface
SDK	Software Development Kit

Conclusion

Congratulations on completing this comprehensive guide! You now understand:

✅ How modern TTS systems work (from text to audio)
✅ The Kokoro-82M architecture (StyleTTS2 + ISTFTNet)
✅ The complete pipeline (G2P → Synthesis → Post-processing)
✅ Every file in the project and its purpose
✅ Audio processing fundamentals (sample rate, normalization, pitch shifting)
✅ Gradio interface development
✅ Hugging Face Spaces deployment

Next Steps

Experiment: Try different voices and parameters
Extend: Add new features (voice mixing, SSML support)
Optimize: Profile and improve performance
Learn: Dive into the StyleTTS2 paper
Build: Create your own TTS applications!

Document created by Yash Chowdhary
For the Kokoro TTS Academic Project

📚 Kokoro TTS: Complete Technical Guide & Learning Documentation

Table of Contents

1. Introduction

What is This Project?

Why Kokoro?