# Kokoro CoreML (HAR-Optimized) High-performance **Kokoro TTS CoreML conversion** with Apple Neural Engine (ANE) optimized HAR decoder buckets. This repository contains precompiled `.mlpackage` models for fast on-device speech synthesis on Apple platforms. --- *based on this this open source project* https://github.com/mattmireles/kokoro-coreml ## πŸ“¦ Included Models ### 🧠 Duration Model (Stage 1) - `kokoro_duration.mlpackage` Handles variable-length text and predicts phoneme durations + intermediate features. --- ### πŸ”Š HAR Decoder Buckets (Stage 2 – ANE Optimized) Fixed-size audio synthesis models: - `KokoroDecoder_HAR_1s.mlpackage` - `KokoroDecoder_HAR_2s.mlpackage` - `KokoroDecoder_HAR_3s.mlpackage` - `KokoroDecoder_HAR_5s.mlpackage` - `KokoroDecoder_HAR_8s.mlpackage` - `KokoroDecoder_HAR_10s.mlpackage` - `KokoroDecoder_HAR_15s.mlpackage` - `KokoroDecoder_HAR_20s.mlpackage` - `KokoroDecoder_HAR.mlpackage` --- ### πŸ” Decoder-Only Variants - `kokoro_decoder_only_3s.mlpackage` - `kokoro_decoder_only_5s.mlpackage` - `kokoro_decoder_only_10s.mlpackage` --- ### πŸŽ› F0 / Feature Variants - `kokoro_f0n_3s.mlpackage` - `kokoro_f0n_5s.mlpackage` - `kokoro_f0n_10s.mlpackage` --- ### πŸ”Š Vocoder Variants - `KokoroVocoder.mlpackage` - `KokoroVocoder_asr64_f0128.mlpackage` - `KokoroVocoder_asr80_f0160.mlpackage` - `KokoroVocoder_asr96_f0192.mlpackage` - `KokoroVocoder_asr128_f0256.mlpackage` - `KokoroVocoder_asr160_f0320.mlpackage` - `KokoroVocoder_asr200_f0400.mlpackage` --- ### πŸ§ͺ Experimental / Alternative - `kokoro_synthesizer_3s.mlpackage` - `kokoro_synthesizer_3s_nolstm.mlpackage` - `StyleTTS2_iSTFTNet_Decoder.mlpackage` --- # πŸ“ Architecture This CoreML conversion uses a **two-stage pipeline** to support Kokoro’s dynamic operations while maximizing ANE performance. --- ## Stage 1 β€” Duration Model (CPU/GPU) **Input:** Variable-length text (`ct.RangeDim`) **Process:** Transformer + LSTM duration prediction **Output:** Phoneme durations + intermediate features **Compute:** CPU / GPU Why CPU? - LSTM layers are not ANE-compatible - Dynamic shape text processing --- ## Stage 2 β€” HAR Decoder (ANE Optimized) **Input:** - Features from duration model - Alignment matrix (built client-side) **Process:** Vocoder synthesis using iSTFTNet architecture **Output:** 24kHz waveform audio **Compute:** Apple Neural Engine --- # πŸš€ Key Innovations - **HAR Processing** – Harmonic/phase separation for ANE efficiency - **Fixed-size Buckets** – Avoid CoreML dynamic shape issues - **Client-side Alignment** – Swift/Python builds alignment matrix - **On-demand Model Loading** – Memory optimized - **MIL Graph Patching** – CoreML compatibility fixes --- # ⚑ Performance ### Runs on ANE (HAR Models) - Conv1D - ConvTranspose1D - LeakyReLU - Element-wise ops **Result:** ~17Γ— faster than real-time synthesis --- ### Runs on CPU/GPU (Duration Model) - LSTM layers - Transformer attention - AdaLayerNorm - Dynamic shape processing --- # 🧠 Production Optimizations - Bucket auto-selection - ~200MB per loaded model - Warm-up optimization - Graceful bucket fallback - Memory cleanup during idle --- # πŸ“₯ Downloading Because `.mlpackage` is a folder, download using Hugging Face CLI: ```bash huggingface-cli download / --local-dir . --- license: mit ---