| # Kokoro CoreML (HAR-Optimized) |
|
|
| High-performance **Kokoro TTS CoreML conversion** with Apple Neural Engine (ANE) optimized HAR decoder buckets. |
|
|
| This repository contains precompiled `.mlpackage` models for fast on-device speech synthesis on Apple platforms. |
|
|
| --- |
|
|
| *based on this this open source project* |
| https://github.com/mattmireles/kokoro-coreml |
|
|
|
|
|
|
| ## π¦ Included Models |
|
|
| ### π§ Duration Model (Stage 1) |
|
|
| - `kokoro_duration.mlpackage` |
|
|
| Handles variable-length text and predicts phoneme durations + intermediate features. |
|
|
| --- |
|
|
| ### π HAR Decoder Buckets (Stage 2 β ANE Optimized) |
|
|
| Fixed-size audio synthesis models: |
|
|
| - `KokoroDecoder_HAR_1s.mlpackage` |
| - `KokoroDecoder_HAR_2s.mlpackage` |
| - `KokoroDecoder_HAR_3s.mlpackage` |
| - `KokoroDecoder_HAR_5s.mlpackage` |
| - `KokoroDecoder_HAR_8s.mlpackage` |
| - `KokoroDecoder_HAR_10s.mlpackage` |
| - `KokoroDecoder_HAR_15s.mlpackage` |
| - `KokoroDecoder_HAR_20s.mlpackage` |
| - `KokoroDecoder_HAR.mlpackage` |
|
|
| --- |
|
|
| ### π Decoder-Only Variants |
|
|
| - `kokoro_decoder_only_3s.mlpackage` |
| - `kokoro_decoder_only_5s.mlpackage` |
| - `kokoro_decoder_only_10s.mlpackage` |
|
|
| --- |
|
|
| ### π F0 / Feature Variants |
|
|
| - `kokoro_f0n_3s.mlpackage` |
| - `kokoro_f0n_5s.mlpackage` |
| - `kokoro_f0n_10s.mlpackage` |
|
|
| --- |
|
|
| ### π Vocoder Variants |
|
|
| - `KokoroVocoder.mlpackage` |
| - `KokoroVocoder_asr64_f0128.mlpackage` |
| - `KokoroVocoder_asr80_f0160.mlpackage` |
| - `KokoroVocoder_asr96_f0192.mlpackage` |
| - `KokoroVocoder_asr128_f0256.mlpackage` |
| - `KokoroVocoder_asr160_f0320.mlpackage` |
| - `KokoroVocoder_asr200_f0400.mlpackage` |
|
|
| --- |
|
|
| ### π§ͺ Experimental / Alternative |
|
|
| - `kokoro_synthesizer_3s.mlpackage` |
| - `kokoro_synthesizer_3s_nolstm.mlpackage` |
| - `StyleTTS2_iSTFTNet_Decoder.mlpackage` |
|
|
| --- |
|
|
| # π Architecture |
|
|
| This CoreML conversion uses a **two-stage pipeline** to support Kokoroβs dynamic operations while maximizing ANE performance. |
|
|
| --- |
|
|
| ## Stage 1 β Duration Model (CPU/GPU) |
|
|
| **Input:** Variable-length text (`ct.RangeDim`) |
| **Process:** Transformer + LSTM duration prediction |
| **Output:** Phoneme durations + intermediate features |
| **Compute:** CPU / GPU |
|
|
| Why CPU? |
| - LSTM layers are not ANE-compatible |
| - Dynamic shape text processing |
|
|
| --- |
|
|
| ## Stage 2 β HAR Decoder (ANE Optimized) |
|
|
| **Input:** |
| - Features from duration model |
| - Alignment matrix (built client-side) |
|
|
| **Process:** |
| Vocoder synthesis using iSTFTNet architecture |
|
|
| **Output:** |
| 24kHz waveform audio |
|
|
| **Compute:** |
| Apple Neural Engine |
|
|
| --- |
|
|
| # π Key Innovations |
|
|
| - **HAR Processing** β Harmonic/phase separation for ANE efficiency |
| - **Fixed-size Buckets** β Avoid CoreML dynamic shape issues |
| - **Client-side Alignment** β Swift/Python builds alignment matrix |
| - **On-demand Model Loading** β Memory optimized |
| - **MIL Graph Patching** β CoreML compatibility fixes |
|
|
| --- |
|
|
| # β‘ Performance |
|
|
| ### Runs on ANE (HAR Models) |
|
|
| - Conv1D |
| - ConvTranspose1D |
| - LeakyReLU |
| - Element-wise ops |
|
|
| **Result:** ~17Γ faster than real-time synthesis |
|
|
| --- |
|
|
| ### Runs on CPU/GPU (Duration Model) |
|
|
| - LSTM layers |
| - Transformer attention |
| - AdaLayerNorm |
| - Dynamic shape processing |
|
|
| --- |
|
|
| # π§ Production Optimizations |
|
|
| - Bucket auto-selection |
| - ~200MB per loaded model |
| - Warm-up optimization |
| - Graceful bucket fallback |
| - Memory cleanup during idle |
|
|
| --- |
|
|
| # π₯ Downloading |
|
|
| Because `.mlpackage` is a folder, download using Hugging Face CLI: |
|
|
| ```bash |
| huggingface-cli download <username>/<repo> --local-dir . |
| |
| --- |
| license: mit |
| --- |
| |