Spaces:
Running
A newer version of the Gradio SDK is available: 6.14.0
π Kokoro TTS: Complete Technical Guide & Learning Documentation
Created by: Yash Chowdhary
Document Version: 1.0
Last Updated: February 2026
Table of Contents
- Introduction
- Project Architecture Overview
- Understanding Text-to-Speech (TTS)
- The Kokoro-82M Model Deep Dive
- File-by-File Breakdown
- Dependencies & Libraries Explained
- The TTS Pipeline: Step-by-Step
- Code Walkthrough
- Audio Processing Concepts
- Gradio Interface Explained
- Deployment on Hugging Face Spaces
- Troubleshooting & Common Issues
- Further Learning Resources
- Glossary of Terms
1. Introduction
What is This Project?
This is an academic Text-to-Speech (TTS) application that converts written text into natural-sounding human speech. It's built using:
- Kokoro-82M: A state-of-the-art, lightweight TTS model
- Gradio: A Python library for building web interfaces
- Hugging Face Spaces: Free cloud hosting for ML applications
Why Kokoro?
| Feature | Kokoro-82M | Traditional Large Models |
|---|---|---|
| Parameters | 82 million | 1-3 billion |
| Model Size | ~330 MB | 5-15 GB |
| Quality | Near state-of-the-art | State-of-the-art |
| Speed (CPU) | 3-11Γ real-time | 0.1-0.5Γ real-time |
| License | Apache 2.0 (Free) | Often proprietary |
Kokoro proves that smaller models can achieve remarkable quality when properly designed.
Project Goals
- Learn how modern TTS systems work
- Understand the complete pipeline from text to audio
- Build a functional, deployable application
- Demonstrate practical ML engineering skills
2. Project Architecture Overview
High-Level System Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER INTERFACE (Gradio) β
β βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββββββ β
β β Text Input β Voice Select β Style Preset β Advanced Controls β β
β β β (28 voices) β (7 styles) β Speed/Pitch/Pause β β
β βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β User clicks "Generate"
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TEXT PREPROCESSING β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Clean whitespace and normalize β β
β β 2. Expand abbreviations (Dr. β Doctor, etc.) β β
β β 3. Enforce character limits (max 5000 chars) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KOKORO TTS ENGINE (KPipeline) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STAGE 1: Grapheme-to-Phoneme (G2P) via Misaki β β
β β "Hello world" β "hΙlΛO wΛΙΙΉld" β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β STAGE 2: Voice Pack Loading β β
β β Load speaker embedding (e.g., af_heart.pt β 523KB tensor) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β STAGE 3: Neural Audio Synthesis β β
β β StyleTTS2 Decoder + ISTFTNet Vocoder β Raw Audio Waveform β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AUDIO POST-PROCESSING β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Combine audio segments β β
β β 2. Insert pauses between sentences β β
β β 3. Apply pitch shift (if requested) β β
β β 4. Normalize volume to -3dB peak β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AUDIO OUTPUT β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Format: 32-bit float WAV @ 24,000 Hz sample rate β β
β β Playback in browser + Download capability β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Flow Summary
Text (string)
β
Phonemes (IPA symbols)
β
Token IDs (integers)
β
Neural Network Processing
β
Audio Waveform (numpy array)
β
Post-processed Audio (normalized, with pauses)
β
Playable Audio File
3. Understanding Text-to-Speech (TTS)
What is TTS?
Text-to-Speech is the technology that converts written text into spoken audio. Modern TTS systems use deep learning to produce remarkably natural-sounding speech.
The Evolution of TTS
| Generation | Era | Technology | Example |
|---|---|---|---|
| 1st | 1960s-1980s | Rule-based synthesis | DECtalk |
| 2nd | 1990s-2000s | Concatenative (splice recordings) | AT&T Natural Voices |
| 3rd | 2010s | Statistical parametric (HMM) | Festival |
| 4th | 2016+ | Neural networks (Deep Learning) | Tacotron, WaveNet |
| 5th | 2023+ | Transformer-based | Kokoro, XTTS, Bark |
Key Concepts in Modern TTS
3.1 Graphemes vs Phonemes
Graphemes are the written letters/characters:
"Hello" = H + e + l + l + o (5 graphemes)
Phonemes are the sound units:
"Hello" = /h/ + /Ι/ + /l/ + /oΚ/ (4 phonemes)
Why phonemes matter: English spelling is inconsistent!
- "though", "through", "thought", "tough" β all different sounds for "ough"
- The model needs consistent sound representations, not arbitrary spellings
3.2 The TTS Pipeline (Traditional)
ββββββββββββ βββββββββββββββ βββββββββββββ βββββββββββ
β Text βββββΆβ Text βββββΆβ Acoustic βββββΆβ Vocoder βββββΆ Audio
β β β Analysis β β Model β β β
ββββββββββββ βββββββββββββββ βββββββββββββ βββββββββββ
β β β
βΌ βΌ βΌ
- G2P conversion - Mel spectrograms - Waveform
- Tokenization - Duration - From spectrogram
- Normalization - Pitch/prosody - to audio
3.3 Kokoro's Innovation: Decoder-Only Architecture
Traditional TTS uses a two-stage approach:
- Encoder: Text β Hidden representation
- Decoder: Hidden representation β Audio
Kokoro simplifies this:
- Decoder Only: Phonemes β Audio (directly!)
This eliminates computational overhead and reduces model size.
4. The Kokoro-82M Model Deep Dive
4.1 Model Specifications
| Attribute | Value |
|---|---|
| Full Name | Kokoro-82M v1.0 |
| Parameters | 82 million |
| Architecture | StyleTTS2 + ISTFTNet (Decoder-only) |
| Input | Phoneme tokens (up to 510 tokens) |
| Output | 24kHz audio waveform |
| Voice Packs | 54 voices across 8 languages |
| Training Data | <100 hours (v0.19), few hundred hours (v1.0) |
| Training Cost | ~$1000 total (1000 A100 GPU hours) |
| License | Apache 2.0 |
4.2 Architecture Components
StyleTTS2 (The Brain)
StyleTTS2 is the foundation architecture, published in the paper:
"StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models"
(Li et al., 2023) - arXiv:2306.07691
Key innovations:
- Style as latent variables: Speech style (emotion, prosody) is modeled as random variables
- Adversarial training: Uses discriminators trained on real speech to improve naturalness
- No reference audio needed: Can generate appropriate styles from text alone
ISTFTNet (The Voice)
ISTFTNet is the vocoder component, from the paper:
"iSTFTNet: Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform"
(Kaneko et al., 2022) - arXiv:2203.02395
Key innovations:
- Direct waveform generation: Uses inverse Short-Time Fourier Transform
- Lightweight: Much smaller than GAN-based vocoders like HiFi-GAN
- Fast inference: Optimized for real-time synthesis
How They Work Together
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KOKORO-82M ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Phoneme Tokens βββΆ βββββββββββββββββββββββββββββββββββββββ β
β (from Misaki) β StyleTTS2 Transformer β β
β β βββββββββββββββββββββ β β
β β β’ Self-attention layers β β
β Voice Embedding βββΆβ β’ Style conditioning β β
β (speaker identity) β β’ Duration prediction β β
β β β’ Prosody modeling β β
β ββββββββββββββββ¬βββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββ β
β β ISTFTNet Vocoder β β
β β βββββββββββββββββ β β
β β β’ Mel-spectrogram generation β β
β β β’ Inverse STFT β β
β β β’ Waveform synthesis β β
β ββββββββββββββββ¬βββββββββββββββββββββββ β
β β β
β βΌ β
β Audio Waveform β
β (24kHz, 32-bit float) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.3 Voice Packs Explained
Each voice is stored as a voice embedding (also called "speaker embedding"):
- File format:
.pt(PyTorch tensor) - Size: ~523KB per voice
- Content: A 256-dimensional vector that captures speaker identity
# How voice packs work internally
voice_embedding = load("af_heart.pt") # Shape: (512, 1, 256)
# This embedding tells the model HOW to speak, not WHAT to speak
The naming convention:
af_heart
ββββ voice name
ββββ gender (f=female, m=male)
βββ accent (a=American, b=British)
4.4 Why 82M Parameters is Enough
Traditional wisdom: "bigger models = better quality"
Kokoro challenges this by:
- Efficient architecture: Decoder-only removes encoder overhead
- Phoneme input: G2P preprocessing reduces model's job (doesn't need to learn spelling)
- Quality training data: Small but high-quality dataset beats large noisy datasets
- Focused scope: Optimized for TTS only, not multi-task
Comparison:
| Model | Parameters | Quality Ranking |
|---|---|---|
| Kokoro-82M | 82M | #1 (TTS Arena) |
| XTTS | 467M | #2-3 |
| MetaVoice | 1.2B | #3-4 |
| Bark | 1B+ | #4-5 |
5. File-by-File Breakdown
Project Structure
kokoro-tts-app/
βββ app.py # Main application (754 lines)
βββ requirements.txt # Python dependencies
βββ packages.txt # System dependencies
βββ README.md # Documentation + HF Space config
βββ examples.py # Standalone usage examples
βββ .gitignore # Git ignore rules
5.1 app.py β The Main Application
This is the heart of the project. Let's break it down by sections:
Section 1: Imports and Configuration (Lines 1-170)
"""
Kokoro TTS - Academic Text-to-Speech Application
================================================
Created by: Yash Chowdhary
"""
import gradio as gr # Web interface
import numpy as np # Numerical operations
import soundfile as sf # Audio file I/O
import re # Regular expressions
from typing import Optional, Tuple # Type hints
from dataclasses import dataclass # Data structures
from kokoro import KPipeline # The TTS engine
What each import does:
| Import | Purpose |
|---|---|
gradio |
Creates the web UI (buttons, sliders, audio player) |
numpy |
Handles audio as numerical arrays |
soundfile |
Reads/writes audio files |
re |
Pattern matching for text preprocessing |
typing |
Adds type hints for better code documentation |
dataclasses |
Creates clean data structures (like StylePreset) |
kokoro.KPipeline |
The actual TTS engine |
Section 2: Voice Catalog (Lines 38-74)
VOICE_CATALOG = {
# voice_id -> (display_name, gender, accent, quality_grade, description)
"af_heart": ("Heart β€οΈ", "Female", "American", "A", "Premium quality, warm and natural"),
"af_bella": ("Bella π₯", "Female", "American", "A-", "Clear and expressive"),
# ... 26 more voices
}
This dictionary maps voice IDs to their metadata. The quality grades (A, B, C, D) are from the official Kokoro documentation and reflect training data quality.
Section 3: Style Presets (Lines 77-145)
@dataclass
class StylePreset:
"""Defines a style preset with associated audio parameters."""
name: str
description: str
speed: float # 0.5 to 2.0
pitch_shift: float # semitones (-5 to +5)
pause_multiplier: float
recommended_voices: list
STYLE_PRESETS = {
"dramatic": StylePreset(
name="Dramatic / Horror",
description="Slower, deeper voice for suspenseful content",
speed=0.85,
pitch_shift=-2, # Lower pitch = deeper voice
pause_multiplier=1.5, # Longer pauses = more tension
recommended_voices=["am_fenrir", "am_onyx", "bm_george"]
),
# ... more presets
}
Why use @dataclass?
- Automatically generates
__init__,__repr__, and other methods - Cleaner than regular classes for data containers
- Type hints are enforced
Section 4: Audio Processing Functions (Lines 150-275)
These are utility functions for manipulating audio:
pitch_shift_audio() β Changes the pitch without changing speed
def pitch_shift_audio(audio: np.ndarray, sample_rate: int, semitones: float) -> np.ndarray:
"""
Shift pitch using resampling technique.
How it works:
1. To raise pitch: Speed up audio, then slow it back down
2. To lower pitch: Slow down audio, then speed it back up
The math: factor = 2^(semitones/12)
- 12 semitones = 1 octave = 2x frequency
- 1 semitone β 1.059x frequency
"""
factor = 2 ** (semitones / 12)
# ... resampling logic
insert_pauses() β Adds silence between segments
def insert_pauses(audio_segments: list, pause_duration_ms: int, sample_rate: int):
"""
Insert silence between audio segments.
pause_duration_ms=300 at 24000Hz = 7200 samples of zeros
"""
pause_samples = int(sample_rate * pause_duration_ms / 1000)
silence = np.zeros(pause_samples, dtype=np.float32)
# ... concatenation logic
normalize_audio() β Ensures consistent volume
def normalize_audio(audio: np.ndarray, target_db: float = -3.0):
"""
Normalize to target dB level.
Why -3dB? Leaves headroom to prevent clipping while
maintaining good volume.
Formula: gain = 10^(target_db/20) / peak_amplitude
"""
Section 5: TTS Engine Class (Lines 280-410)
class KokoroTTSEngine:
"""
Wrapper class for Kokoro TTS with additional processing capabilities.
"""
def __init__(self):
# Initialize pipelines for both accents
self.pipelines = {
'a': KPipeline(lang_code='a'), # American English
'b': KPipeline(lang_code='b'), # British English
}
# Add custom pronunciation for "Kokoro"
self.pipelines['a'].g2p.lexicon.golds['kokoro'] = 'kΛOkΙΙΉO'
Why two pipelines?
- American and British English have different phoneme sets
- Different pronunciations: "schedule" = /ΛskedΚuΛl/ (US) vs /ΛΚedjuΛl/ (UK)
- The voice ID's first letter determines which pipeline to use
Section 6: Gradio Interface (Lines 550-754)
with gr.Blocks(
title="Kokoro TTS - Academic Text-to-Speech",
theme=gr.themes.Soft(),
css="..." # Custom styling
) as demo:
# Header
gr.Markdown("# ποΈ Kokoro TTS...")
# Input controls
text_input = gr.Textbox(label="Text to Synthesize")
voice_dropdown = gr.Dropdown(choices=..., label="Voice")
# Output
audio_output = gr.Audio(label="Generated Audio")
# Event handler
generate_btn.click(
fn=generate_speech,
inputs=[text_input, voice_dropdown, ...],
outputs=[audio_output]
)
5.2 requirements.txt β Python Dependencies
kokoro>=0.9.4 # The TTS model and pipeline
soundfile>=0.12.1 # Audio file reading/writing
numpy>=1.24.0 # Numerical array operations
gradio==5.50.0 # Web interface framework
torch>=2.0.0 # Deep learning framework
torchaudio>=2.0.0 # Audio processing for PyTorch
misaki[en]>=0.9.0 # Grapheme-to-Phoneme conversion
Why these specific versions?
gradio==5.50.0: Pinned to avoid breaking changes in Gradio 6.xkokoro>=0.9.4: Requires Python 3.10-3.12 (not 3.13!)misaki[en]: The[en]installs English language support
5.3 packages.txt β System Dependencies
espeak-ng # Fallback phonemizer for out-of-vocabulary words
ffmpeg # Audio encoding/decoding
libsndfile1 # C library for reading/writing audio files
Important: This file must contain ONLY package names, no comments!
5.4 README.md β Documentation + Configuration
The README serves two purposes:
- YAML Frontmatter (lines 1-19): Configures Hugging Face Spaces
---
title: Kokoro TTS - Academic Text-to-Speech
emoji: ποΈ
sdk: gradio
sdk_version: 5.50.0
python_version: "3.10" # Critical! Kokoro needs Python <3.13
suggested_hardware: cpu-basic
---
- Documentation (rest of file): User guide and technical details
6. Dependencies & Libraries Explained
6.1 Kokoro (kokoro package)
What it is: The main TTS library that wraps the Kokoro-82M model.
Key classes:
from kokoro import KPipeline, KModel
# KPipeline: High-level interface (recommended)
pipeline = KPipeline(lang_code='a') # 'a'=American, 'b'=British
# Generate speech
for graphemes, phonemes, audio in pipeline(text, voice='af_heart', speed=1.0):
# graphemes: original text chunk
# phonemes: IPA representation
# audio: numpy array of audio samples
# KModel: Low-level model access
model = KModel().to('cpu').eval()
audio = model(phoneme_tokens, voice_embedding, speed)
Internal workflow:
KPipeline
β
βββΆ Misaki G2P βββΆ Text to Phonemes
β
βββΆ Voice Loader βββΆ Load speaker embedding
β
βββΆ KModel βββΆ Generate audio
6.2 Misaki (misaki package)
What it is: Grapheme-to-Phoneme (G2P) library designed for Kokoro.
How it works:
from misaki import en
g2p = en.G2P(trf=False, british=False, fallback=None)
text = "Hello, world!"
phonemes, tokens = g2p(text)
# phonemes: "hΙlΛO, wΛΙΙΉld!"
# tokens: list of token IDs for the model
G2P Strategy (Hybrid approach):
Input Text
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 1. Dictionary Lookup (Gold/Silver) β βββ Known words
βββββββββββββββββββββββββββββββββββββββ
β Unknown word?
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 2. Rule-based Fallback β βββ espeak-ng
βββββββββββββββββββββββββββββββββββββββ
β Still unknown?
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 3. Neural Network Fallback β βββ BART-based model
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
Phoneme Output
Custom pronunciations:
# Use Markdown-style syntax for custom pronunciation
text = "[Kokoro](/kΛOkΙΙΉO/) is a TTS model."
# The /slashes/ contain IPA phonemes
Phoneme inventory:
Misaki uses 49 phonemes for English:
- 41 shared between US and UK
- 4 American-only (like the "r" in "car")
- 4 British-only (like the "Ι" in "lot")
6.3 PyTorch (torch package)
What it is: The deep learning framework that runs the neural network.
Role in this project:
- Loads model weights (
.pthfiles) - Runs forward passes through the network
- Handles tensor operations on CPU/GPU
import torch
# Model inference
with torch.no_grad(): # Disable gradient computation (faster)
audio_tensor = model(phonemes, voice_embedding, speed)
audio_numpy = audio_tensor.numpy() # Convert to numpy for playback
6.4 Gradio (gradio package)
What it is: A Python library for building web interfaces for ML models.
Key concepts:
import gradio as gr
# Components (UI elements)
text_input = gr.Textbox(label="Input")
slider = gr.Slider(minimum=0, maximum=1)
audio_output = gr.Audio()
# Blocks (layout container)
with gr.Blocks() as demo:
with gr.Row(): # Horizontal layout
with gr.Column(): # Vertical layout
# Components here
# Event handlers
button.click(
fn=my_function, # Python function to call
inputs=[text_input], # Input components
outputs=[audio_output] # Output components
)
# Launch
demo.launch()
Why Gradio?
- No frontend code needed (HTML/CSS/JS)
- Automatic API generation
- Easy deployment to Hugging Face Spaces
- Built-in audio player with download
6.5 NumPy (numpy package)
What it is: Fundamental library for numerical computing in Python.
Role in audio processing:
import numpy as np
# Audio is represented as a 1D array of floats
audio = np.array([0.0, 0.1, 0.2, -0.1, ...], dtype=np.float32)
# Sample rate: 24000 Hz means 24000 samples = 1 second
# 1 minute of audio = 24000 * 60 = 1,440,000 samples
# Creating silence
silence = np.zeros(24000, dtype=np.float32) # 1 second of silence
# Concatenating audio
combined = np.concatenate([audio1, silence, audio2])
# Normalization
peak = np.max(np.abs(audio))
normalized = audio / peak * 0.9 # Scale to 90% of max
6.6 SoundFile (soundfile package)
What it is: Library for reading and writing audio files.
import soundfile as sf
# Write audio to file
sf.write('output.wav', audio_array, samplerate=24000)
# Read audio from file
audio, samplerate = sf.read('input.wav')
Supported formats: WAV, FLAC, OGG, and more.
7. The TTS Pipeline: Step-by-Step
Let's trace what happens when you click "Generate Speech":
Step 1: Text Input Received
text = "Hello, world! This is Kokoro speaking."
Step 2: Text Preprocessing
def preprocess_text(text):
# Clean whitespace
text = re.sub(r'\s+', ' ', text.strip())
# Expand abbreviations
text = text.replace("Dr.", "Doctor")
text = text.replace("Mr.", "Mister")
# etc.
return text
# Result: "Hello, world! This is Kokoro speaking."
Step 3: Pipeline Selection
voice = "af_heart" # American female
lang_code = voice[0] # 'a' = American
pipeline = pipelines['a'] # American English pipeline
Step 4: Grapheme-to-Phoneme Conversion
# Inside the pipeline:
phonemes = pipeline.g2p("Hello, world!")
# Result: "hΙlΛO, wΛΙΙΉld!"
What happens inside G2P:
"Hello"
β Lookup in dictionary
β Found: "hΙlΛO"
"world"
β Lookup in dictionary
β Found: "wΛΙΙΉld"
Punctuation preserved: ", !"
Step 5: Tokenization
# Phonemes converted to token IDs
tokens = [50, 157, 43, 135, ...] # Integer IDs
# Each phoneme has a unique ID (defined in config.json)
# Maximum context: 510 tokens
Step 6: Voice Embedding Loading
# Load the voice pack
voice_pack = pipeline.load_voice("af_heart")
# Result: Tensor of shape (512, 1, 256)
# Select embedding based on token count
ref_s = voice_pack[len(tokens) - 1]
Step 7: Neural Network Forward Pass
# The actual synthesis
audio = model(
tokens, # What to say (phoneme IDs)
ref_s, # How to say it (voice embedding)
speed # How fast (1.0 = normal)
)
# Result: Tensor of audio samples
Inside the model:
Tokens + Voice Embedding
β
βΌ
βββββββββββββββββββββββ
β Transformer Layers β β Self-attention, style modeling
βββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Duration Predictor β β How long each sound lasts
βββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Mel-Spectrogram β β Intermediate representation
βββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β ISTFTNet Vocoder β β Convert to waveform
βββββββββββββββββββββββ
β
βΌ
Audio Waveform
Step 8: Audio Post-Processing
# Combine segments
audio_segments = [seg1, seg2, seg3]
combined = insert_pauses(audio_segments, pause_ms=300, sample_rate=24000)
# Apply pitch shift if requested
if pitch_shift != 0:
combined = pitch_shift_audio(combined, 24000, pitch_shift)
# Normalize volume
combined = normalize_audio(combined, target_db=-3.0)
Step 9: Output
return (24000, combined) # (sample_rate, audio_array)
# Gradio displays this in the Audio component
8. Code Walkthrough
8.1 The generate_speech() Function
This is the main callback function that Gradio calls:
def generate_speech(
text: str, # User's input text
voice: str, # Voice ID like "af_heart"
style: str, # Style preset like "dramatic"
speed: float, # Speed multiplier
pitch: float, # Pitch shift in semitones
pause: int, # Pause duration in ms
use_style_defaults: bool, # Use preset values?
) -> Tuple[int, np.ndarray]:
"""
Main generation function for Gradio interface.
Returns:
Tuple of (sample_rate, audio_array) for Gradio Audio component
"""
# Validation
if not text.strip():
gr.Warning("Please enter some text to synthesize.")
return None
try:
if use_style_defaults:
# Use style preset parameters
sample_rate, audio = tts_engine.generate_with_style(
text=text,
voice=voice,
style_preset=style,
)
else:
# Use manual control parameters
sample_rate, audio = tts_engine.generate(
text=text,
voice=voice,
speed=speed,
pitch_shift=pitch,
pause_between_sentences_ms=pause,
)
return (sample_rate, audio)
except Exception as e:
gr.Error(f"Generation failed: {str(e)}")
return None
8.2 The KokoroTTSEngine.generate() Method
def generate(
self,
text: str,
voice: str = "af_heart",
speed: float = 1.0,
pitch_shift: float = 0.0,
pause_between_sentences_ms: int = 300,
) -> Tuple[int, np.ndarray]:
"""
Generate speech from text with full parameter control.
"""
# 1. Preprocess and validate
text = preprocess_text(text.strip()[:MAX_CHAR_LIMIT])
if not text:
return SAMPLE_RATE, np.zeros(1, dtype=np.float32)
# 2. Clamp parameters to valid ranges
speed = max(0.5, min(2.0, speed))
pitch_shift = max(-5, min(5, pitch_shift))
# 3. Select pipeline based on voice accent
lang_code = voice[0] if voice[0] in self.pipelines else 'a'
pipeline = self.pipelines[lang_code]
# 4. Generate audio segments
audio_segments = []
try:
# The pipeline yields (graphemes, phonemes, audio) tuples
for _, phonemes, audio in pipeline(text, voice=voice, speed=speed):
if audio is not None:
# Convert PyTorch tensor to numpy if needed
audio_np = audio.numpy() if hasattr(audio, 'numpy') else audio
audio_segments.append(audio_np)
except Exception as e:
print(f"Generation error: {e}")
return SAMPLE_RATE, np.zeros(1, dtype=np.float32)
if not audio_segments:
return SAMPLE_RATE, np.zeros(1, dtype=np.float32)
# 5. Post-process: combine with pauses
combined_audio = insert_pauses(
audio_segments,
pause_between_sentences_ms,
SAMPLE_RATE
)
# 6. Post-process: apply pitch shift
if pitch_shift != 0:
combined_audio = pitch_shift_audio(
combined_audio,
SAMPLE_RATE,
pitch_shift
)
# 7. Post-process: normalize volume
combined_audio = normalize_audio(combined_audio)
return SAMPLE_RATE, combined_audio
8.3 Pitch Shifting Explained
def pitch_shift_audio(audio: np.ndarray, sample_rate: int, semitones: float) -> np.ndarray:
"""
Shift the pitch of audio by a given number of semitones.
THEORY:
-------
Pitch and time are linked. If you play audio faster:
- Duration decreases
- Pitch increases (chipmunk effect)
To change pitch WITHOUT changing duration:
1. Resample to change pitch (also changes duration)
2. Resample again to restore original duration
MATH:
-----
- 1 octave = 12 semitones = 2x frequency
- factor = 2^(semitones/12)
- +1 semitone = 2^(1/12) β 1.059x frequency
- -1 semitone = 2^(-1/12) β 0.944x frequency
"""
if semitones == 0:
return audio # No change needed
# Calculate the pitch shift factor
factor = 2 ** (semitones / 12)
original_length = len(audio)
# Step 1: Resample to shift pitch
# If factor > 1 (raise pitch), we need MORE samples
# If factor < 1 (lower pitch), we need FEWER samples
new_length = int(original_length / factor)
# Create indices for interpolation
indices = np.linspace(0, original_length - 1, new_length)
# Linear interpolation (simple but effective)
shifted = np.interp(indices, np.arange(original_length), audio)
# Step 2: Resample back to original length
# This restores the original duration
final_indices = np.linspace(0, len(shifted) - 1, original_length)
result = np.interp(final_indices, np.arange(len(shifted)), shifted)
return result.astype(np.float32)
Visual explanation:
Original audio (1 second, 24000 samples):
[ββββββββββββββββββββββββββββββββββββββββββββββββ]
Raise pitch by 2 semitones (factor = 1.122):
Step 1 - Stretch to 21384 samples:
[ββββββββββββββββββββββββββββββββββββββββββ]
Step 2 - Compress back to 24000:
[ββββββββββββββββββββββββββββββββββββββββββββββββ]
βββ Same duration, higher pitch!
Lower pitch by 2 semitones (factor = 0.891):
Step 1 - Compress to 26935 samples:
[ββββββββββββββββββββββββββββββββββββββββββββββββββββββ]
Step 2 - Stretch back to 24000:
[ββββββββββββββββββββββββββββββββββββββββββββββββ]
βββ Same duration, lower pitch!
9. Audio Processing Concepts
9.1 Digital Audio Basics
What is digital audio?
Sound is a continuous wave of air pressure. Digital audio represents this as discrete samples taken at regular intervals.
Analog Sound Wave:
β±β² β±β² β±β²
β± β² β± β² β± β²
βββ±βββββ²β±βββββ²β±βββββ²ββ
Digital Samples (at regular intervals):
β β β β β β β β β β
β β β β β β
β β β
β β β
Key parameters:
| Parameter | Definition | Kokoro Value |
|---|---|---|
| Sample Rate | Samples per second | 24,000 Hz |
| Bit Depth | Precision per sample | 32-bit float |
| Channels | Mono vs Stereo | Mono (1 channel) |
Nyquist Theorem:
- To capture frequency F, sample at β₯2F
- Human hearing: up to ~20kHz
- 24kHz sample rate captures up to 12kHz (adequate for speech)
9.2 Audio as NumPy Arrays
import numpy as np
# Audio is a 1D array of floats
audio = np.array([0.0, 0.1, 0.15, 0.1, 0.0, -0.1, -0.15, -0.1, 0.0, ...])
# Value range: -1.0 to +1.0 (normalized)
# 0.0 = silence
# +1.0 = maximum positive pressure
# -1.0 = maximum negative pressure
# Duration calculation
sample_rate = 24000 # Hz
num_samples = len(audio)
duration_seconds = num_samples / sample_rate
# Example: 48000 samples at 24kHz = 2 seconds
9.3 Sample Rate Explained
Sample Rate = 24,000 Hz means:
- 24,000 measurements per second
- Each sample is 1/24000 = 0.0000417 seconds apart
Timeline:
0s 1s 2s
|βββββββββββββββ|βββββββββββββββ|
24000 samples 24000 samples ...
Higher sample rate:
+ Better frequency reproduction
+ Larger file size
+ More processing required
Lower sample rate:
+ Smaller files
+ Faster processing
- Possible quality loss
9.4 Audio Normalization
def normalize_audio(audio: np.ndarray, target_db: float = -3.0) -> np.ndarray:
"""
Why normalize?
1. Consistent volume across different generations
2. Prevent clipping (distortion when values exceed Β±1.0)
3. Optimal playback volume
Why -3dB (not 0dB)?
- Leaves "headroom" for peaks
- Prevents distortion on some playback systems
- Industry standard practice
"""
# Find the peak (maximum absolute value)
peak = np.max(np.abs(audio))
if peak == 0:
return audio # Silent audio, nothing to normalize
# Convert dB to amplitude
# dB = 20 * log10(amplitude)
# amplitude = 10^(dB/20)
target_amplitude = 10 ** (target_db / 20)
# -3dB β 10^(-3/20) β 0.708
# Calculate required gain
gain = target_amplitude / peak
# Apply gain
return (audio * gain).astype(np.float32)
Decibels (dB) explained:
dB Scale (relative to maximum):
0 dB βββββββ Maximum (1.0 amplitude)
-3 dB βββββββ 0.708 amplitude (our target)
-6 dB βββββββ 0.5 amplitude
-12 dB ββββββ 0.25 amplitude
-20 dB ββββββ 0.1 amplitude
-40 dB ββββββ 0.01 amplitude
-60 dB ββββββ 0.001 amplitude (nearly silent)
9.5 Silence and Pauses
def create_silence(duration_ms: int, sample_rate: int) -> np.ndarray:
"""
Create a silent audio segment.
Silence = array of zeros
"""
num_samples = int(sample_rate * duration_ms / 1000)
return np.zeros(num_samples, dtype=np.float32)
# 300ms pause at 24kHz
pause = create_silence(300, 24000)
# Result: array of 7200 zeros
10. Gradio Interface Explained
10.1 Component Types Used
# TEXT INPUT
text_input = gr.Textbox(
label="π Text to Synthesize",
placeholder="Enter your text here...",
lines=6, # Height in lines
max_lines=15, # Max expandable height
info="Maximum 5000 characters" # Helper text
)
# DROPDOWN SELECTION
voice_dropdown = gr.Dropdown(
choices=[("Display Name", "value"), ...], # (label, value) pairs
value="af_heart", # Default selection
label="π Voice",
info="Select a voice"
)
# SLIDER
speed_slider = gr.Slider(
minimum=0.5, # Min value
maximum=2.0, # Max value
value=1.0, # Default
step=0.05, # Increment
label="π Speed"
)
# CHECKBOX
use_defaults = gr.Checkbox(
label="Use Style Preset Defaults",
value=True, # Default checked
info="When checked, style preset values override manual controls"
)
# AUDIO OUTPUT
audio_output = gr.Audio(
label="π Generated Audio",
type="numpy", # Expects (sample_rate, numpy_array)
interactive=False, # User can't upload
autoplay=True # Auto-play when generated
)
# MARKDOWN
gr.Markdown("# Title") # Rendered as HTML
10.2 Layout System
with gr.Blocks() as demo:
# Row: Horizontal layout
with gr.Row():
component1 # Left
component2 # Right
# Column: Vertical layout
with gr.Column(scale=1): # scale controls relative width
component3 # Top
component4 # Bottom
# Accordion: Collapsible section
with gr.Accordion("Advanced Options", open=False):
component5
component6
# Tabs: Tabbed interface
with gr.Tab("Tab 1"):
component7
with gr.Tab("Tab 2"):
component8
10.3 Event Handling
# Button click event
generate_btn.click(
fn=generate_speech, # Function to call
inputs=[text, voice, speed], # Input components
outputs=[audio_output] # Output components
)
# Dropdown change event
style_dropdown.change(
fn=update_style_info,
inputs=[style_dropdown],
outputs=[info_markdown]
)
# Chained events (update multiple things)
style_dropdown.change(
fn=update_controls,
inputs=[style_dropdown, use_defaults],
outputs=[speed_slider, pitch_slider, pause_slider]
)
10.4 CSS Customization
with gr.Blocks(
css="""
/* Custom class styling */
.main-title {
text-align: center;
margin-bottom: 1rem;
}
/* Use CSS variables for theme compatibility */
.info-box {
border: 1px solid var(--border-color-primary);
color: var(--body-text-color);
}
"""
) as demo:
gr.Markdown("...", elem_classes=["main-title"])
11. Deployment on Hugging Face Spaces
11.1 What is Hugging Face Spaces?
Hugging Face Spaces is a free hosting platform for ML demos:
- Free CPU instances (2 vCPU, 16GB RAM)
- GPU instances available (paid)
- Git-based deployment
- Automatic dependency installation
11.2 Configuration via README.md
The YAML frontmatter in README.md configures your Space:
---
# Required
title: Kokoro TTS # Display name
sdk: gradio # Framework (gradio/streamlit/static)
app_file: app.py # Entry point
# Recommended
sdk_version: 5.50.0 # Pin Gradio version
python_version: "3.10" # Pin Python version (CRITICAL for Kokoro)
# Optional
emoji: ποΈ # Favicon
colorFrom: blue # Gradient start
colorTo: purple # Gradient end
pinned: false # Pin to profile?
license: apache-2.0 # License
suggested_hardware: cpu-basic # Default hardware
short_description: TTS with 28 voices
tags:
- text-to-speech
- tts
---
11.3 Build Process
When you push to the Space repository:
1. HF detects changes
β
βΌ
2. Reads README.md frontmatter
β
βΌ
3. Creates Docker container
β
βββΆ Base image: python:3.10
β
βββΆ Installs packages.txt (apt-get)
β βββ espeak-ng, ffmpeg, libsndfile1
β
βββΆ Installs requirements.txt (pip)
βββ kokoro, gradio, torch, etc.
β
βΌ
4. Runs: python app.py
β
βΌ
5. Exposes port 7860
β
βΌ
6. Space is live!
11.4 Common Deployment Issues
| Issue | Cause | Solution |
|---|---|---|
| Build fails on packages.txt | Comments in file | Remove all comments |
| Kokoro not found | Python 3.13 | Set python_version: "3.10" |
| Gradio API error | Gradio 6.x breaking changes | Pin gradio==5.50.0 |
| Out of memory | Model too large | Use CPU basic, optimize code |
| Timeout on load | Slow model download | Add loading indicator |
11.5 Monitoring Your Space
- Logs: Click "Logs" tab to see stdout/stderr
- Factory Rebuild: Settings β Factory Reboot (clears cache)
- Container Restart: Settings β Restart Space
12. Troubleshooting & Common Issues
12.1 "Kokoro not found" Error
ERROR: Could not find a version that satisfies the requirement kokoro>=0.9.4
Cause: Kokoro requires Python 3.10-3.12, but HF Spaces defaults to 3.13
Solution: Add to README.md frontmatter:
python_version: "3.10"
12.2 Gradio "unexpected keyword argument" Error
TypeError: Blocks.launch() got an unexpected keyword argument 'show_api'
Cause: Gradio 6.x removed/moved several parameters
Solution: Pin Gradio version:
# requirements.txt
gradio==5.50.0
12.3 "Unable to locate package" Error
E: Unable to locate package # System dependencies
Cause: Comments in packages.txt
Solution: Remove ALL comments:
# packages.txt (correct)
espeak-ng
ffmpeg
libsndfile1
12.4 Audio Not Playing
Possible causes:
- Audio array is empty (check generation succeeded)
- Wrong return format (must be
(sample_rate, array)) - Sample rate mismatch
Debug:
print(f"Audio shape: {audio.shape}")
print(f"Audio range: [{audio.min()}, {audio.max()}]")
print(f"Sample rate: {sample_rate}")
12.5 Model Loading Timeout
Cause: First run downloads ~350MB model
Solution: Add loading indicator or pre-cache:
print("Loading Kokoro TTS Engine...") # Shows in logs
pipeline = KPipeline(lang_code='a') # Downloads model
print("Ready!")
13. Further Learning Resources
Official Documentation
| Resource | URL |
|---|---|
| Kokoro Model Card | https://huggingface.co/hexgrad/Kokoro-82M |
| Misaki G2P | https://github.com/hexgrad/misaki |
| Gradio Docs | https://gradio.app/docs |
| HF Spaces Docs | https://huggingface.co/docs/hub/spaces |
Research Papers
| Paper | Topic |
|---|---|
| StyleTTS 2 | Core architecture |
| ISTFTNet | Vocoder |
| G2P Blog | Grapheme-to-Phoneme |
Video Tutorials
Search for:
- "Kokoro TTS tutorial"
- "Gradio machine learning app"
- "Hugging Face Spaces deployment"
Related Projects
| Project | Description |
|---|---|
| Coqui TTS | Open-source TTS library |
| Bark | Transformer-based TTS |
| VITS | Fast end-to-end TTS |
| Piper | Lightweight local TTS |
14. Glossary of Terms
| Term | Definition |
|---|---|
| TTS | Text-to-Speech: Converting written text to spoken audio |
| G2P | Grapheme-to-Phoneme: Converting letters to sounds |
| Grapheme | Written unit (letter or character) |
| Phoneme | Sound unit in a language |
| IPA | International Phonetic Alphabet: Standard phoneme notation |
| Vocoder | Voice encoder: Converts features to audio waveform |
| Mel-spectrogram | Visual representation of audio frequencies over time |
| STFT | Short-Time Fourier Transform: Converts audio to spectrogram |
| iSTFT | Inverse STFT: Converts spectrogram back to audio |
| Embedding | Dense vector representation (e.g., voice identity) |
| Sample Rate | Audio samples per second (Hz) |
| Bit Depth | Precision of each audio sample |
| Normalization | Adjusting audio volume to target level |
| Decibels (dB) | Logarithmic unit for audio levels |
| Transformer | Neural network architecture using attention |
| Inference | Running a trained model to make predictions |
| Latent Space | Compressed representation learned by a model |
| Fine-tuning | Adapting a pre-trained model to new data |
| API | Application Programming Interface |
| SDK | Software Development Kit |
Conclusion
Congratulations on completing this comprehensive guide! You now understand:
β
How modern TTS systems work (from text to audio)
β
The Kokoro-82M architecture (StyleTTS2 + ISTFTNet)
β
The complete pipeline (G2P β Synthesis β Post-processing)
β
Every file in the project and its purpose
β
Audio processing fundamentals (sample rate, normalization, pitch shifting)
β
Gradio interface development
β
Hugging Face Spaces deployment
Next Steps
- Experiment: Try different voices and parameters
- Extend: Add new features (voice mixing, SSML support)
- Optimize: Profile and improve performance
- Learn: Dive into the StyleTTS2 paper
- Build: Create your own TTS applications!
Document created by Yash Chowdhary
For the Kokoro TTS Academic Project