Text_To_Speech / KOKORO_TTS_COMPLETE_GUIDE.md
YashChowdhary's picture
Upload KOKORO_TTS_COMPLETE_GUIDE.md
bdabfa8 verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

πŸ“š Kokoro TTS: Complete Technical Guide & Learning Documentation

Created by: Yash Chowdhary
Document Version: 1.0
Last Updated: February 2026


Table of Contents

  1. Introduction
  2. Project Architecture Overview
  3. Understanding Text-to-Speech (TTS)
  4. The Kokoro-82M Model Deep Dive
  5. File-by-File Breakdown
  6. Dependencies & Libraries Explained
  7. The TTS Pipeline: Step-by-Step
  8. Code Walkthrough
  9. Audio Processing Concepts
  10. Gradio Interface Explained
  11. Deployment on Hugging Face Spaces
  12. Troubleshooting & Common Issues
  13. Further Learning Resources
  14. Glossary of Terms

1. Introduction

What is This Project?

This is an academic Text-to-Speech (TTS) application that converts written text into natural-sounding human speech. It's built using:

  • Kokoro-82M: A state-of-the-art, lightweight TTS model
  • Gradio: A Python library for building web interfaces
  • Hugging Face Spaces: Free cloud hosting for ML applications

Why Kokoro?

Feature Kokoro-82M Traditional Large Models
Parameters 82 million 1-3 billion
Model Size ~330 MB 5-15 GB
Quality Near state-of-the-art State-of-the-art
Speed (CPU) 3-11Γ— real-time 0.1-0.5Γ— real-time
License Apache 2.0 (Free) Often proprietary

Kokoro proves that smaller models can achieve remarkable quality when properly designed.

Project Goals

  1. Learn how modern TTS systems work
  2. Understand the complete pipeline from text to audio
  3. Build a functional, deployable application
  4. Demonstrate practical ML engineering skills

2. Project Architecture Overview

High-Level System Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        USER INTERFACE (Gradio)                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Text Input  β”‚ Voice Select β”‚ Style Preset β”‚ Advanced Controls   β”‚   β”‚
β”‚  β”‚             β”‚  (28 voices) β”‚  (7 styles)  β”‚ Speed/Pitch/Pause   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚ User clicks "Generate"
                                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      TEXT PREPROCESSING                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ 1. Clean whitespace and normalize                                β”‚   β”‚
β”‚  β”‚ 2. Expand abbreviations (Dr. β†’ Doctor, etc.)                    β”‚   β”‚
β”‚  β”‚ 3. Enforce character limits (max 5000 chars)                    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    KOKORO TTS ENGINE (KPipeline)                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ STAGE 1: Grapheme-to-Phoneme (G2P) via Misaki                   β”‚   β”‚
β”‚  β”‚   "Hello world" β†’ "hΙ™lˈO wˈɜɹld"                                β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚ STAGE 2: Voice Pack Loading                                      β”‚   β”‚
β”‚  β”‚   Load speaker embedding (e.g., af_heart.pt β†’ 523KB tensor)     β”‚   β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   β”‚
β”‚  β”‚ STAGE 3: Neural Audio Synthesis                                  β”‚   β”‚
β”‚  β”‚   StyleTTS2 Decoder + ISTFTNet Vocoder β†’ Raw Audio Waveform     β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      AUDIO POST-PROCESSING                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ 1. Combine audio segments                                        β”‚   β”‚
β”‚  β”‚ 2. Insert pauses between sentences                               β”‚   β”‚
β”‚  β”‚ 3. Apply pitch shift (if requested)                             β”‚   β”‚
β”‚  β”‚ 4. Normalize volume to -3dB peak                                β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         AUDIO OUTPUT                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Format: 32-bit float WAV @ 24,000 Hz sample rate                β”‚   β”‚
β”‚  β”‚ Playback in browser + Download capability                       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow Summary

Text (string)
    ↓
Phonemes (IPA symbols)
    ↓
Token IDs (integers)
    ↓
Neural Network Processing
    ↓
Audio Waveform (numpy array)
    ↓
Post-processed Audio (normalized, with pauses)
    ↓
Playable Audio File

3. Understanding Text-to-Speech (TTS)

What is TTS?

Text-to-Speech is the technology that converts written text into spoken audio. Modern TTS systems use deep learning to produce remarkably natural-sounding speech.

The Evolution of TTS

Generation Era Technology Example
1st 1960s-1980s Rule-based synthesis DECtalk
2nd 1990s-2000s Concatenative (splice recordings) AT&T Natural Voices
3rd 2010s Statistical parametric (HMM) Festival
4th 2016+ Neural networks (Deep Learning) Tacotron, WaveNet
5th 2023+ Transformer-based Kokoro, XTTS, Bark

Key Concepts in Modern TTS

3.1 Graphemes vs Phonemes

Graphemes are the written letters/characters:

"Hello" = H + e + l + l + o (5 graphemes)

Phonemes are the sound units:

"Hello" = /h/ + /Ι™/ + /l/ + /oʊ/ (4 phonemes)

Why phonemes matter: English spelling is inconsistent!

  • "though", "through", "thought", "tough" β€” all different sounds for "ough"
  • The model needs consistent sound representations, not arbitrary spellings

3.2 The TTS Pipeline (Traditional)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Text   │───▢│ Text        │───▢│  Acoustic │───▢│ Vocoder │───▢ Audio
β”‚          β”‚    β”‚ Analysis    β”‚    β”‚  Model    β”‚    β”‚         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚                   β”‚               β”‚
                     β–Ό                   β–Ό               β–Ό
               - G2P conversion    - Mel spectrograms   - Waveform
               - Tokenization      - Duration           - From spectrogram
               - Normalization     - Pitch/prosody      - to audio

3.3 Kokoro's Innovation: Decoder-Only Architecture

Traditional TTS uses a two-stage approach:

  1. Encoder: Text β†’ Hidden representation
  2. Decoder: Hidden representation β†’ Audio

Kokoro simplifies this:

  1. Decoder Only: Phonemes β†’ Audio (directly!)

This eliminates computational overhead and reduces model size.


4. The Kokoro-82M Model Deep Dive

4.1 Model Specifications

Attribute Value
Full Name Kokoro-82M v1.0
Parameters 82 million
Architecture StyleTTS2 + ISTFTNet (Decoder-only)
Input Phoneme tokens (up to 510 tokens)
Output 24kHz audio waveform
Voice Packs 54 voices across 8 languages
Training Data <100 hours (v0.19), few hundred hours (v1.0)
Training Cost ~$1000 total (1000 A100 GPU hours)
License Apache 2.0

4.2 Architecture Components

StyleTTS2 (The Brain)

StyleTTS2 is the foundation architecture, published in the paper:

"StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models"
(Li et al., 2023) - arXiv:2306.07691

Key innovations:

  • Style as latent variables: Speech style (emotion, prosody) is modeled as random variables
  • Adversarial training: Uses discriminators trained on real speech to improve naturalness
  • No reference audio needed: Can generate appropriate styles from text alone

ISTFTNet (The Voice)

ISTFTNet is the vocoder component, from the paper:

"iSTFTNet: Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform"
(Kaneko et al., 2022) - arXiv:2203.02395

Key innovations:

  • Direct waveform generation: Uses inverse Short-Time Fourier Transform
  • Lightweight: Much smaller than GAN-based vocoders like HiFi-GAN
  • Fast inference: Optimized for real-time synthesis

How They Work Together

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     KOKORO-82M ARCHITECTURE                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚   Phoneme Tokens ──▢ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   (from Misaki)      β”‚      StyleTTS2 Transformer          β”‚   β”‚
β”‚                      β”‚      ─────────────────────          β”‚   β”‚
β”‚                      β”‚  β€’ Self-attention layers            β”‚   β”‚
β”‚   Voice Embedding ──▢│  β€’ Style conditioning               β”‚   β”‚
β”‚   (speaker identity) β”‚  β€’ Duration prediction              β”‚   β”‚
β”‚                      β”‚  β€’ Prosody modeling                 β”‚   β”‚
β”‚                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                     β”‚                           β”‚
β”‚                                     β–Ό                           β”‚
β”‚                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                      β”‚        ISTFTNet Vocoder             β”‚   β”‚
β”‚                      β”‚        ─────────────────            β”‚   β”‚
β”‚                      β”‚  β€’ Mel-spectrogram generation       β”‚   β”‚
β”‚                      β”‚  β€’ Inverse STFT                     β”‚   β”‚
β”‚                      β”‚  β€’ Waveform synthesis               β”‚   β”‚
β”‚                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                     β”‚                           β”‚
β”‚                                     β–Ό                           β”‚
β”‚                            Audio Waveform                       β”‚
β”‚                          (24kHz, 32-bit float)                  β”‚
β”‚                                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.3 Voice Packs Explained

Each voice is stored as a voice embedding (also called "speaker embedding"):

  • File format: .pt (PyTorch tensor)
  • Size: ~523KB per voice
  • Content: A 256-dimensional vector that captures speaker identity
# How voice packs work internally
voice_embedding = load("af_heart.pt")  # Shape: (512, 1, 256)
# This embedding tells the model HOW to speak, not WHAT to speak

The naming convention:

af_heart
β”‚β”œβ”€β”€ voice name
│└── gender (f=female, m=male)
└── accent (a=American, b=British)

4.4 Why 82M Parameters is Enough

Traditional wisdom: "bigger models = better quality"

Kokoro challenges this by:

  1. Efficient architecture: Decoder-only removes encoder overhead
  2. Phoneme input: G2P preprocessing reduces model's job (doesn't need to learn spelling)
  3. Quality training data: Small but high-quality dataset beats large noisy datasets
  4. Focused scope: Optimized for TTS only, not multi-task

Comparison:

Model Parameters Quality Ranking
Kokoro-82M 82M #1 (TTS Arena)
XTTS 467M #2-3
MetaVoice 1.2B #3-4
Bark 1B+ #4-5

5. File-by-File Breakdown

Project Structure

kokoro-tts-app/
β”œβ”€β”€ app.py              # Main application (754 lines)
β”œβ”€β”€ requirements.txt    # Python dependencies
β”œβ”€β”€ packages.txt        # System dependencies
β”œβ”€β”€ README.md           # Documentation + HF Space config
β”œβ”€β”€ examples.py         # Standalone usage examples
└── .gitignore          # Git ignore rules

5.1 app.py β€” The Main Application

This is the heart of the project. Let's break it down by sections:

Section 1: Imports and Configuration (Lines 1-170)

"""
Kokoro TTS - Academic Text-to-Speech Application
================================================
Created by: Yash Chowdhary
"""

import gradio as gr      # Web interface
import numpy as np       # Numerical operations
import soundfile as sf   # Audio file I/O
import re                # Regular expressions
from typing import Optional, Tuple  # Type hints
from dataclasses import dataclass   # Data structures
from kokoro import KPipeline        # The TTS engine

What each import does:

Import Purpose
gradio Creates the web UI (buttons, sliders, audio player)
numpy Handles audio as numerical arrays
soundfile Reads/writes audio files
re Pattern matching for text preprocessing
typing Adds type hints for better code documentation
dataclasses Creates clean data structures (like StylePreset)
kokoro.KPipeline The actual TTS engine

Section 2: Voice Catalog (Lines 38-74)

VOICE_CATALOG = {
    # voice_id -> (display_name, gender, accent, quality_grade, description)
    "af_heart": ("Heart ❀️", "Female", "American", "A", "Premium quality, warm and natural"),
    "af_bella": ("Bella πŸ”₯", "Female", "American", "A-", "Clear and expressive"),
    # ... 26 more voices
}

This dictionary maps voice IDs to their metadata. The quality grades (A, B, C, D) are from the official Kokoro documentation and reflect training data quality.

Section 3: Style Presets (Lines 77-145)

@dataclass
class StylePreset:
    """Defines a style preset with associated audio parameters."""
    name: str
    description: str
    speed: float           # 0.5 to 2.0
    pitch_shift: float     # semitones (-5 to +5)
    pause_multiplier: float
    recommended_voices: list

STYLE_PRESETS = {
    "dramatic": StylePreset(
        name="Dramatic / Horror",
        description="Slower, deeper voice for suspenseful content",
        speed=0.85,
        pitch_shift=-2,       # Lower pitch = deeper voice
        pause_multiplier=1.5, # Longer pauses = more tension
        recommended_voices=["am_fenrir", "am_onyx", "bm_george"]
    ),
    # ... more presets
}

Why use @dataclass?

  • Automatically generates __init__, __repr__, and other methods
  • Cleaner than regular classes for data containers
  • Type hints are enforced

Section 4: Audio Processing Functions (Lines 150-275)

These are utility functions for manipulating audio:

pitch_shift_audio() β€” Changes the pitch without changing speed

def pitch_shift_audio(audio: np.ndarray, sample_rate: int, semitones: float) -> np.ndarray:
    """
    Shift pitch using resampling technique.
    
    How it works:
    1. To raise pitch: Speed up audio, then slow it back down
    2. To lower pitch: Slow down audio, then speed it back up
    
    The math: factor = 2^(semitones/12)
    - 12 semitones = 1 octave = 2x frequency
    - 1 semitone β‰ˆ 1.059x frequency
    """
    factor = 2 ** (semitones / 12)
    # ... resampling logic

insert_pauses() β€” Adds silence between segments

def insert_pauses(audio_segments: list, pause_duration_ms: int, sample_rate: int):
    """
    Insert silence between audio segments.
    
    pause_duration_ms=300 at 24000Hz = 7200 samples of zeros
    """
    pause_samples = int(sample_rate * pause_duration_ms / 1000)
    silence = np.zeros(pause_samples, dtype=np.float32)
    # ... concatenation logic

normalize_audio() β€” Ensures consistent volume

def normalize_audio(audio: np.ndarray, target_db: float = -3.0):
    """
    Normalize to target dB level.
    
    Why -3dB? Leaves headroom to prevent clipping while
    maintaining good volume.
    
    Formula: gain = 10^(target_db/20) / peak_amplitude
    """

Section 5: TTS Engine Class (Lines 280-410)

class KokoroTTSEngine:
    """
    Wrapper class for Kokoro TTS with additional processing capabilities.
    """
    
    def __init__(self):
        # Initialize pipelines for both accents
        self.pipelines = {
            'a': KPipeline(lang_code='a'),  # American English
            'b': KPipeline(lang_code='b'),  # British English
        }
        
        # Add custom pronunciation for "Kokoro"
        self.pipelines['a'].g2p.lexicon.golds['kokoro'] = 'kˈOkΙ™ΙΉO'

Why two pipelines?

  • American and British English have different phoneme sets
  • Different pronunciations: "schedule" = /ˈskedΚ’uːl/ (US) vs /ΛˆΚƒedjuːl/ (UK)
  • The voice ID's first letter determines which pipeline to use

Section 6: Gradio Interface (Lines 550-754)

with gr.Blocks(
    title="Kokoro TTS - Academic Text-to-Speech",
    theme=gr.themes.Soft(),
    css="..."  # Custom styling
) as demo:
    
    # Header
    gr.Markdown("# πŸŽ™οΈ Kokoro TTS...")
    
    # Input controls
    text_input = gr.Textbox(label="Text to Synthesize")
    voice_dropdown = gr.Dropdown(choices=..., label="Voice")
    
    # Output
    audio_output = gr.Audio(label="Generated Audio")
    
    # Event handler
    generate_btn.click(
        fn=generate_speech,
        inputs=[text_input, voice_dropdown, ...],
        outputs=[audio_output]
    )

5.2 requirements.txt β€” Python Dependencies

kokoro>=0.9.4          # The TTS model and pipeline
soundfile>=0.12.1      # Audio file reading/writing
numpy>=1.24.0          # Numerical array operations
gradio==5.50.0         # Web interface framework
torch>=2.0.0           # Deep learning framework
torchaudio>=2.0.0      # Audio processing for PyTorch
misaki[en]>=0.9.0      # Grapheme-to-Phoneme conversion

Why these specific versions?

  • gradio==5.50.0: Pinned to avoid breaking changes in Gradio 6.x
  • kokoro>=0.9.4: Requires Python 3.10-3.12 (not 3.13!)
  • misaki[en]: The [en] installs English language support

5.3 packages.txt β€” System Dependencies

espeak-ng      # Fallback phonemizer for out-of-vocabulary words
ffmpeg         # Audio encoding/decoding
libsndfile1    # C library for reading/writing audio files

Important: This file must contain ONLY package names, no comments!

5.4 README.md β€” Documentation + Configuration

The README serves two purposes:

  1. YAML Frontmatter (lines 1-19): Configures Hugging Face Spaces
---
title: Kokoro TTS - Academic Text-to-Speech
emoji: πŸŽ™οΈ
sdk: gradio
sdk_version: 5.50.0
python_version: "3.10"    # Critical! Kokoro needs Python <3.13
suggested_hardware: cpu-basic
---
  1. Documentation (rest of file): User guide and technical details

6. Dependencies & Libraries Explained

6.1 Kokoro (kokoro package)

What it is: The main TTS library that wraps the Kokoro-82M model.

Key classes:

from kokoro import KPipeline, KModel

# KPipeline: High-level interface (recommended)
pipeline = KPipeline(lang_code='a')  # 'a'=American, 'b'=British

# Generate speech
for graphemes, phonemes, audio in pipeline(text, voice='af_heart', speed=1.0):
    # graphemes: original text chunk
    # phonemes: IPA representation
    # audio: numpy array of audio samples

# KModel: Low-level model access
model = KModel().to('cpu').eval()
audio = model(phoneme_tokens, voice_embedding, speed)

Internal workflow:

KPipeline
    β”‚
    β”œβ”€β–Ά Misaki G2P ──▢ Text to Phonemes
    β”‚
    β”œβ”€β–Ά Voice Loader ──▢ Load speaker embedding
    β”‚
    └─▢ KModel ──▢ Generate audio

6.2 Misaki (misaki package)

What it is: Grapheme-to-Phoneme (G2P) library designed for Kokoro.

How it works:

from misaki import en

g2p = en.G2P(trf=False, british=False, fallback=None)
text = "Hello, world!"
phonemes, tokens = g2p(text)
# phonemes: "hΙ™lˈO, wˈɜɹld!"
# tokens: list of token IDs for the model

G2P Strategy (Hybrid approach):

Input Text
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. Dictionary Lookup (Gold/Silver) β”‚ ◄── Known words
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚ Unknown word?
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  2. Rule-based Fallback             β”‚ ◄── espeak-ng
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚ Still unknown?
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  3. Neural Network Fallback         β”‚ ◄── BART-based model
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
Phoneme Output

Custom pronunciations:

# Use Markdown-style syntax for custom pronunciation
text = "[Kokoro](/kˈOkΙ™ΙΉO/) is a TTS model."
# The /slashes/ contain IPA phonemes

Phoneme inventory:

Misaki uses 49 phonemes for English:

  • 41 shared between US and UK
  • 4 American-only (like the "r" in "car")
  • 4 British-only (like the "Ι’" in "lot")

6.3 PyTorch (torch package)

What it is: The deep learning framework that runs the neural network.

Role in this project:

  • Loads model weights (.pth files)
  • Runs forward passes through the network
  • Handles tensor operations on CPU/GPU
import torch

# Model inference
with torch.no_grad():  # Disable gradient computation (faster)
    audio_tensor = model(phonemes, voice_embedding, speed)
    audio_numpy = audio_tensor.numpy()  # Convert to numpy for playback

6.4 Gradio (gradio package)

What it is: A Python library for building web interfaces for ML models.

Key concepts:

import gradio as gr

# Components (UI elements)
text_input = gr.Textbox(label="Input")
slider = gr.Slider(minimum=0, maximum=1)
audio_output = gr.Audio()

# Blocks (layout container)
with gr.Blocks() as demo:
    with gr.Row():      # Horizontal layout
        with gr.Column():  # Vertical layout
            # Components here
    
    # Event handlers
    button.click(
        fn=my_function,     # Python function to call
        inputs=[text_input],  # Input components
        outputs=[audio_output]  # Output components
    )

# Launch
demo.launch()

Why Gradio?

  • No frontend code needed (HTML/CSS/JS)
  • Automatic API generation
  • Easy deployment to Hugging Face Spaces
  • Built-in audio player with download

6.5 NumPy (numpy package)

What it is: Fundamental library for numerical computing in Python.

Role in audio processing:

import numpy as np

# Audio is represented as a 1D array of floats
audio = np.array([0.0, 0.1, 0.2, -0.1, ...], dtype=np.float32)

# Sample rate: 24000 Hz means 24000 samples = 1 second
# 1 minute of audio = 24000 * 60 = 1,440,000 samples

# Creating silence
silence = np.zeros(24000, dtype=np.float32)  # 1 second of silence

# Concatenating audio
combined = np.concatenate([audio1, silence, audio2])

# Normalization
peak = np.max(np.abs(audio))
normalized = audio / peak * 0.9  # Scale to 90% of max

6.6 SoundFile (soundfile package)

What it is: Library for reading and writing audio files.

import soundfile as sf

# Write audio to file
sf.write('output.wav', audio_array, samplerate=24000)

# Read audio from file
audio, samplerate = sf.read('input.wav')

Supported formats: WAV, FLAC, OGG, and more.


7. The TTS Pipeline: Step-by-Step

Let's trace what happens when you click "Generate Speech":

Step 1: Text Input Received

text = "Hello, world! This is Kokoro speaking."

Step 2: Text Preprocessing

def preprocess_text(text):
    # Clean whitespace
    text = re.sub(r'\s+', ' ', text.strip())
    
    # Expand abbreviations
    text = text.replace("Dr.", "Doctor")
    text = text.replace("Mr.", "Mister")
    # etc.
    
    return text

# Result: "Hello, world! This is Kokoro speaking."

Step 3: Pipeline Selection

voice = "af_heart"  # American female
lang_code = voice[0]  # 'a' = American

pipeline = pipelines['a']  # American English pipeline

Step 4: Grapheme-to-Phoneme Conversion

# Inside the pipeline:
phonemes = pipeline.g2p("Hello, world!")
# Result: "hΙ™lˈO, wˈɜɹld!"

What happens inside G2P:

"Hello" 
    β†’ Lookup in dictionary
    β†’ Found: "hΙ™lˈO"

"world"
    β†’ Lookup in dictionary  
    β†’ Found: "wˈɜɹld"

Punctuation preserved: ", !"

Step 5: Tokenization

# Phonemes converted to token IDs
tokens = [50, 157, 43, 135, ...]  # Integer IDs

# Each phoneme has a unique ID (defined in config.json)
# Maximum context: 510 tokens

Step 6: Voice Embedding Loading

# Load the voice pack
voice_pack = pipeline.load_voice("af_heart")
# Result: Tensor of shape (512, 1, 256)

# Select embedding based on token count
ref_s = voice_pack[len(tokens) - 1]

Step 7: Neural Network Forward Pass

# The actual synthesis
audio = model(
    tokens,        # What to say (phoneme IDs)
    ref_s,         # How to say it (voice embedding)
    speed          # How fast (1.0 = normal)
)
# Result: Tensor of audio samples

Inside the model:

Tokens + Voice Embedding
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Transformer Layers  β”‚  ← Self-attention, style modeling
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Duration Predictor  β”‚  ← How long each sound lasts
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Mel-Spectrogram     β”‚  ← Intermediate representation
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ISTFTNet Vocoder    β”‚  ← Convert to waveform
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
    Audio Waveform

Step 8: Audio Post-Processing

# Combine segments
audio_segments = [seg1, seg2, seg3]
combined = insert_pauses(audio_segments, pause_ms=300, sample_rate=24000)

# Apply pitch shift if requested
if pitch_shift != 0:
    combined = pitch_shift_audio(combined, 24000, pitch_shift)

# Normalize volume
combined = normalize_audio(combined, target_db=-3.0)

Step 9: Output

return (24000, combined)  # (sample_rate, audio_array)
# Gradio displays this in the Audio component

8. Code Walkthrough

8.1 The generate_speech() Function

This is the main callback function that Gradio calls:

def generate_speech(
    text: str,           # User's input text
    voice: str,          # Voice ID like "af_heart"
    style: str,          # Style preset like "dramatic"
    speed: float,        # Speed multiplier
    pitch: float,        # Pitch shift in semitones
    pause: int,          # Pause duration in ms
    use_style_defaults: bool,  # Use preset values?
) -> Tuple[int, np.ndarray]:
    """
    Main generation function for Gradio interface.
    
    Returns:
        Tuple of (sample_rate, audio_array) for Gradio Audio component
    """
    
    # Validation
    if not text.strip():
        gr.Warning("Please enter some text to synthesize.")
        return None
    
    try:
        if use_style_defaults:
            # Use style preset parameters
            sample_rate, audio = tts_engine.generate_with_style(
                text=text,
                voice=voice,
                style_preset=style,
            )
        else:
            # Use manual control parameters
            sample_rate, audio = tts_engine.generate(
                text=text,
                voice=voice,
                speed=speed,
                pitch_shift=pitch,
                pause_between_sentences_ms=pause,
            )
        
        return (sample_rate, audio)
    
    except Exception as e:
        gr.Error(f"Generation failed: {str(e)}")
        return None

8.2 The KokoroTTSEngine.generate() Method

def generate(
    self,
    text: str,
    voice: str = "af_heart",
    speed: float = 1.0,
    pitch_shift: float = 0.0,
    pause_between_sentences_ms: int = 300,
) -> Tuple[int, np.ndarray]:
    """
    Generate speech from text with full parameter control.
    """
    
    # 1. Preprocess and validate
    text = preprocess_text(text.strip()[:MAX_CHAR_LIMIT])
    if not text:
        return SAMPLE_RATE, np.zeros(1, dtype=np.float32)
    
    # 2. Clamp parameters to valid ranges
    speed = max(0.5, min(2.0, speed))
    pitch_shift = max(-5, min(5, pitch_shift))
    
    # 3. Select pipeline based on voice accent
    lang_code = voice[0] if voice[0] in self.pipelines else 'a'
    pipeline = self.pipelines[lang_code]
    
    # 4. Generate audio segments
    audio_segments = []
    try:
        # The pipeline yields (graphemes, phonemes, audio) tuples
        for _, phonemes, audio in pipeline(text, voice=voice, speed=speed):
            if audio is not None:
                # Convert PyTorch tensor to numpy if needed
                audio_np = audio.numpy() if hasattr(audio, 'numpy') else audio
                audio_segments.append(audio_np)
    except Exception as e:
        print(f"Generation error: {e}")
        return SAMPLE_RATE, np.zeros(1, dtype=np.float32)
    
    if not audio_segments:
        return SAMPLE_RATE, np.zeros(1, dtype=np.float32)
    
    # 5. Post-process: combine with pauses
    combined_audio = insert_pauses(
        audio_segments, 
        pause_between_sentences_ms, 
        SAMPLE_RATE
    )
    
    # 6. Post-process: apply pitch shift
    if pitch_shift != 0:
        combined_audio = pitch_shift_audio(
            combined_audio, 
            SAMPLE_RATE, 
            pitch_shift
        )
    
    # 7. Post-process: normalize volume
    combined_audio = normalize_audio(combined_audio)
    
    return SAMPLE_RATE, combined_audio

8.3 Pitch Shifting Explained

def pitch_shift_audio(audio: np.ndarray, sample_rate: int, semitones: float) -> np.ndarray:
    """
    Shift the pitch of audio by a given number of semitones.
    
    THEORY:
    -------
    Pitch and time are linked. If you play audio faster:
    - Duration decreases
    - Pitch increases (chipmunk effect)
    
    To change pitch WITHOUT changing duration:
    1. Resample to change pitch (also changes duration)
    2. Resample again to restore original duration
    
    MATH:
    -----
    - 1 octave = 12 semitones = 2x frequency
    - factor = 2^(semitones/12)
    - +1 semitone = 2^(1/12) β‰ˆ 1.059x frequency
    - -1 semitone = 2^(-1/12) β‰ˆ 0.944x frequency
    """
    
    if semitones == 0:
        return audio  # No change needed
    
    # Calculate the pitch shift factor
    factor = 2 ** (semitones / 12)
    
    original_length = len(audio)
    
    # Step 1: Resample to shift pitch
    # If factor > 1 (raise pitch), we need MORE samples
    # If factor < 1 (lower pitch), we need FEWER samples
    new_length = int(original_length / factor)
    
    # Create indices for interpolation
    indices = np.linspace(0, original_length - 1, new_length)
    
    # Linear interpolation (simple but effective)
    shifted = np.interp(indices, np.arange(original_length), audio)
    
    # Step 2: Resample back to original length
    # This restores the original duration
    final_indices = np.linspace(0, len(shifted) - 1, original_length)
    result = np.interp(final_indices, np.arange(len(shifted)), shifted)
    
    return result.astype(np.float32)

Visual explanation:

Original audio (1 second, 24000 samples):
[β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ]

Raise pitch by 2 semitones (factor = 1.122):
Step 1 - Stretch to 21384 samples:
[β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ]

Step 2 - Compress back to 24000:
[β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ]
 └── Same duration, higher pitch!

Lower pitch by 2 semitones (factor = 0.891):
Step 1 - Compress to 26935 samples:
[β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ]

Step 2 - Stretch back to 24000:
[β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ]
 └── Same duration, lower pitch!

9. Audio Processing Concepts

9.1 Digital Audio Basics

What is digital audio?

Sound is a continuous wave of air pressure. Digital audio represents this as discrete samples taken at regular intervals.

Analog Sound Wave:
    β•±β•²    β•±β•²    β•±β•²
   β•±  β•²  β•±  β•²  β•±  β•²
──╱────╲╱────╲╱────╲──
  
Digital Samples (at regular intervals):
  β”‚  β”‚  β”‚  β”‚  β”‚  β”‚  β”‚  β”‚  β”‚  β”‚
  ●  ●     ●     ●     ●     ●
     ●        ●        ●
        ●        ●        ●

Key parameters:

Parameter Definition Kokoro Value
Sample Rate Samples per second 24,000 Hz
Bit Depth Precision per sample 32-bit float
Channels Mono vs Stereo Mono (1 channel)

Nyquist Theorem:

  • To capture frequency F, sample at β‰₯2F
  • Human hearing: up to ~20kHz
  • 24kHz sample rate captures up to 12kHz (adequate for speech)

9.2 Audio as NumPy Arrays

import numpy as np

# Audio is a 1D array of floats
audio = np.array([0.0, 0.1, 0.15, 0.1, 0.0, -0.1, -0.15, -0.1, 0.0, ...])

# Value range: -1.0 to +1.0 (normalized)
# 0.0 = silence
# +1.0 = maximum positive pressure
# -1.0 = maximum negative pressure

# Duration calculation
sample_rate = 24000  # Hz
num_samples = len(audio)
duration_seconds = num_samples / sample_rate

# Example: 48000 samples at 24kHz = 2 seconds

9.3 Sample Rate Explained

Sample Rate = 24,000 Hz means:
- 24,000 measurements per second
- Each sample is 1/24000 = 0.0000417 seconds apart

Timeline:
0s              1s              2s
|───────────────|───────────────|
24000 samples   24000 samples   ...

Higher sample rate:
+ Better frequency reproduction
+ Larger file size
+ More processing required

Lower sample rate:
+ Smaller files
+ Faster processing
- Possible quality loss

9.4 Audio Normalization

def normalize_audio(audio: np.ndarray, target_db: float = -3.0) -> np.ndarray:
    """
    Why normalize?
    1. Consistent volume across different generations
    2. Prevent clipping (distortion when values exceed Β±1.0)
    3. Optimal playback volume
    
    Why -3dB (not 0dB)?
    - Leaves "headroom" for peaks
    - Prevents distortion on some playback systems
    - Industry standard practice
    """
    
    # Find the peak (maximum absolute value)
    peak = np.max(np.abs(audio))
    
    if peak == 0:
        return audio  # Silent audio, nothing to normalize
    
    # Convert dB to amplitude
    # dB = 20 * log10(amplitude)
    # amplitude = 10^(dB/20)
    target_amplitude = 10 ** (target_db / 20)
    # -3dB β†’ 10^(-3/20) β‰ˆ 0.708
    
    # Calculate required gain
    gain = target_amplitude / peak
    
    # Apply gain
    return (audio * gain).astype(np.float32)

Decibels (dB) explained:

dB Scale (relative to maximum):

 0 dB ─────── Maximum (1.0 amplitude)
-3 dB ─────── 0.708 amplitude (our target)
-6 dB ─────── 0.5 amplitude
-12 dB ────── 0.25 amplitude
-20 dB ────── 0.1 amplitude
-40 dB ────── 0.01 amplitude
-60 dB ────── 0.001 amplitude (nearly silent)

9.5 Silence and Pauses

def create_silence(duration_ms: int, sample_rate: int) -> np.ndarray:
    """
    Create a silent audio segment.
    
    Silence = array of zeros
    """
    num_samples = int(sample_rate * duration_ms / 1000)
    return np.zeros(num_samples, dtype=np.float32)

# 300ms pause at 24kHz
pause = create_silence(300, 24000)
# Result: array of 7200 zeros

10. Gradio Interface Explained

10.1 Component Types Used

# TEXT INPUT
text_input = gr.Textbox(
    label="πŸ“ Text to Synthesize",
    placeholder="Enter your text here...",
    lines=6,        # Height in lines
    max_lines=15,   # Max expandable height
    info="Maximum 5000 characters"  # Helper text
)

# DROPDOWN SELECTION
voice_dropdown = gr.Dropdown(
    choices=[("Display Name", "value"), ...],  # (label, value) pairs
    value="af_heart",   # Default selection
    label="🎭 Voice",
    info="Select a voice"
)

# SLIDER
speed_slider = gr.Slider(
    minimum=0.5,    # Min value
    maximum=2.0,    # Max value
    value=1.0,      # Default
    step=0.05,      # Increment
    label="πŸƒ Speed"
)

# CHECKBOX
use_defaults = gr.Checkbox(
    label="Use Style Preset Defaults",
    value=True,     # Default checked
    info="When checked, style preset values override manual controls"
)

# AUDIO OUTPUT
audio_output = gr.Audio(
    label="πŸ”Š Generated Audio",
    type="numpy",       # Expects (sample_rate, numpy_array)
    interactive=False,  # User can't upload
    autoplay=True       # Auto-play when generated
)

# MARKDOWN
gr.Markdown("# Title")  # Rendered as HTML

10.2 Layout System

with gr.Blocks() as demo:
    
    # Row: Horizontal layout
    with gr.Row():
        component1  # Left
        component2  # Right
    
    # Column: Vertical layout
    with gr.Column(scale=1):  # scale controls relative width
        component3  # Top
        component4  # Bottom
    
    # Accordion: Collapsible section
    with gr.Accordion("Advanced Options", open=False):
        component5
        component6
    
    # Tabs: Tabbed interface
    with gr.Tab("Tab 1"):
        component7
    with gr.Tab("Tab 2"):
        component8

10.3 Event Handling

# Button click event
generate_btn.click(
    fn=generate_speech,           # Function to call
    inputs=[text, voice, speed],  # Input components
    outputs=[audio_output]        # Output components
)

# Dropdown change event
style_dropdown.change(
    fn=update_style_info,
    inputs=[style_dropdown],
    outputs=[info_markdown]
)

# Chained events (update multiple things)
style_dropdown.change(
    fn=update_controls,
    inputs=[style_dropdown, use_defaults],
    outputs=[speed_slider, pitch_slider, pause_slider]
)

10.4 CSS Customization

with gr.Blocks(
    css="""
        /* Custom class styling */
        .main-title {
            text-align: center;
            margin-bottom: 1rem;
        }
        
        /* Use CSS variables for theme compatibility */
        .info-box {
            border: 1px solid var(--border-color-primary);
            color: var(--body-text-color);
        }
    """
) as demo:
    gr.Markdown("...", elem_classes=["main-title"])

11. Deployment on Hugging Face Spaces

11.1 What is Hugging Face Spaces?

Hugging Face Spaces is a free hosting platform for ML demos:

  • Free CPU instances (2 vCPU, 16GB RAM)
  • GPU instances available (paid)
  • Git-based deployment
  • Automatic dependency installation

11.2 Configuration via README.md

The YAML frontmatter in README.md configures your Space:

---
# Required
title: Kokoro TTS                    # Display name
sdk: gradio                          # Framework (gradio/streamlit/static)
app_file: app.py                     # Entry point

# Recommended
sdk_version: 5.50.0                  # Pin Gradio version
python_version: "3.10"               # Pin Python version (CRITICAL for Kokoro)

# Optional
emoji: πŸŽ™οΈ                            # Favicon
colorFrom: blue                      # Gradient start
colorTo: purple                      # Gradient end
pinned: false                        # Pin to profile?
license: apache-2.0                  # License
suggested_hardware: cpu-basic        # Default hardware
short_description: TTS with 28 voices
tags:
  - text-to-speech
  - tts
---

11.3 Build Process

When you push to the Space repository:

1. HF detects changes
   β”‚
   β–Ό
2. Reads README.md frontmatter
   β”‚
   β–Ό
3. Creates Docker container
   β”‚
   β”œβ”€β–Ά Base image: python:3.10
   β”‚
   β”œβ”€β–Ά Installs packages.txt (apt-get)
   β”‚   └── espeak-ng, ffmpeg, libsndfile1
   β”‚
   └─▢ Installs requirements.txt (pip)
       └── kokoro, gradio, torch, etc.
   β”‚
   β–Ό
4. Runs: python app.py
   β”‚
   β–Ό
5. Exposes port 7860
   β”‚
   β–Ό
6. Space is live!

11.4 Common Deployment Issues

Issue Cause Solution
Build fails on packages.txt Comments in file Remove all comments
Kokoro not found Python 3.13 Set python_version: "3.10"
Gradio API error Gradio 6.x breaking changes Pin gradio==5.50.0
Out of memory Model too large Use CPU basic, optimize code
Timeout on load Slow model download Add loading indicator

11.5 Monitoring Your Space

  • Logs: Click "Logs" tab to see stdout/stderr
  • Factory Rebuild: Settings β†’ Factory Reboot (clears cache)
  • Container Restart: Settings β†’ Restart Space

12. Troubleshooting & Common Issues

12.1 "Kokoro not found" Error

ERROR: Could not find a version that satisfies the requirement kokoro>=0.9.4

Cause: Kokoro requires Python 3.10-3.12, but HF Spaces defaults to 3.13

Solution: Add to README.md frontmatter:

python_version: "3.10"

12.2 Gradio "unexpected keyword argument" Error

TypeError: Blocks.launch() got an unexpected keyword argument 'show_api'

Cause: Gradio 6.x removed/moved several parameters

Solution: Pin Gradio version:

# requirements.txt
gradio==5.50.0

12.3 "Unable to locate package" Error

E: Unable to locate package # System dependencies

Cause: Comments in packages.txt

Solution: Remove ALL comments:

# packages.txt (correct)
espeak-ng
ffmpeg
libsndfile1

12.4 Audio Not Playing

Possible causes:

  1. Audio array is empty (check generation succeeded)
  2. Wrong return format (must be (sample_rate, array))
  3. Sample rate mismatch

Debug:

print(f"Audio shape: {audio.shape}")
print(f"Audio range: [{audio.min()}, {audio.max()}]")
print(f"Sample rate: {sample_rate}")

12.5 Model Loading Timeout

Cause: First run downloads ~350MB model

Solution: Add loading indicator or pre-cache:

print("Loading Kokoro TTS Engine...")  # Shows in logs
pipeline = KPipeline(lang_code='a')    # Downloads model
print("Ready!")

13. Further Learning Resources

Official Documentation

Research Papers

Paper Topic
StyleTTS 2 Core architecture
ISTFTNet Vocoder
G2P Blog Grapheme-to-Phoneme

Video Tutorials

Search for:

  • "Kokoro TTS tutorial"
  • "Gradio machine learning app"
  • "Hugging Face Spaces deployment"

Related Projects

Project Description
Coqui TTS Open-source TTS library
Bark Transformer-based TTS
VITS Fast end-to-end TTS
Piper Lightweight local TTS

14. Glossary of Terms

Term Definition
TTS Text-to-Speech: Converting written text to spoken audio
G2P Grapheme-to-Phoneme: Converting letters to sounds
Grapheme Written unit (letter or character)
Phoneme Sound unit in a language
IPA International Phonetic Alphabet: Standard phoneme notation
Vocoder Voice encoder: Converts features to audio waveform
Mel-spectrogram Visual representation of audio frequencies over time
STFT Short-Time Fourier Transform: Converts audio to spectrogram
iSTFT Inverse STFT: Converts spectrogram back to audio
Embedding Dense vector representation (e.g., voice identity)
Sample Rate Audio samples per second (Hz)
Bit Depth Precision of each audio sample
Normalization Adjusting audio volume to target level
Decibels (dB) Logarithmic unit for audio levels
Transformer Neural network architecture using attention
Inference Running a trained model to make predictions
Latent Space Compressed representation learned by a model
Fine-tuning Adapting a pre-trained model to new data
API Application Programming Interface
SDK Software Development Kit

Conclusion

Congratulations on completing this comprehensive guide! You now understand:

βœ… How modern TTS systems work (from text to audio)
βœ… The Kokoro-82M architecture (StyleTTS2 + ISTFTNet)
βœ… The complete pipeline (G2P β†’ Synthesis β†’ Post-processing)
βœ… Every file in the project and its purpose
βœ… Audio processing fundamentals (sample rate, normalization, pitch shifting)
βœ… Gradio interface development
βœ… Hugging Face Spaces deployment

Next Steps

  1. Experiment: Try different voices and parameters
  2. Extend: Add new features (voice mixing, SSML support)
  3. Optimize: Profile and improve performance
  4. Learn: Dive into the StyleTTS2 paper
  5. Build: Create your own TTS applications!

Document created by Yash Chowdhary
For the Kokoro TTS Academic Project