YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Kokoro TTS CoreML – Runtime Assets

This repository contains the required runtime assets to run Kokoro TTS fully on-device using CoreML.

It includes phoneme resources, G2P models, POS tagging models, and CoreML TTS model bundles required for synthesis.

Reference implementation: https://github.com/philipdaquin/Kokoro-tts-coreml


πŸ“¦ Directory Structure

original/ # Original reference files and conversion artifacts
EspeakData/ # eSpeak phoneme + dictionary resources
g2p/ # Grapheme-to-Phoneme models
POSModels/ # Part-of-speech tagging models
TTSModels/ # CoreML TTS models (.mlpackage)


🧠 Overview

Kokoro TTS CoreML runs using a multi-stage pipeline:

  1. Text normalization
  2. G2P conversion
  3. POS tagging (context refinement)

All components required for fully offline speech synthesis are included here.


πŸ“‚ Folder Details

EspeakData/

Contains phoneme definitions, language dictionaries, and pronunciation mappings used during G2P processing.

g2p/

Grapheme-to-Phoneme conversion models.
Converts normalized text into phoneme sequences before duration prediction.

POSModels/

Part-of-speech models used to refine pronunciation and contextual prosody.

TTSModels/

Contains the CoreML models used for synthesis:

  • Duration model
  • HAR decoder buckets
  • Vocoder variants
  • Feature / F0 variants

These .mlpackage bundles are optimized for Apple Silicon and ANE acceleration.


βš™οΈ Architecture

Kokoro CoreML uses a two-stage inference pipeline:

Stage 1 – Duration Model (CPU/GPU)

  • Variable-length text input
  • Transformer + LSTM layers
  • Outputs phoneme durations + intermediate features

Stage 2 – HAR Decoder (ANE Optimized)

  • Fixed-size synthesis buckets
  • iSTFTNet vocoder architecture
  • 24kHz waveform output
  • ~17Γ— faster than real-time on supported devices

πŸš€ Requirements

  • iOS 17+ / macOS Sonoma+
  • Apple Silicon recommended
  • ANE-capable hardware for optimal performance
  • ~200MB RAM per loaded model bucket

πŸ”§ Integration Notes

  • Load models on-demand
  • Select synthesis bucket dynamically based on predicted duration
  • First inference will be slower (warm-up effect)
  • Unload unused models to conserve memory

πŸ“₯ Usage

Clone the repository:

git clone https://huggingface.co/\<username>/<repo>

Or download via Hugging Face CLI:

huggingface-cli download / --local-dir .


πŸ“Œ License

Refer to the original Kokoro TTS license and any included third-party licenses inside their respective folders.

Ensure attribution is preserved if redistributing.


πŸ™ Credits

Based on the Kokoro TTS CoreML conversion project:
https://github.com/philipdaquin/Kokoro-tts-coreml

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support