# Kokoro CoreML (HAR-Optimized)

High-performance **Kokoro TTS CoreML conversion** with Apple Neural Engine (ANE) optimized HAR decoder buckets.

This repository contains precompiled `.mlpackage` models for fast on-device speech synthesis on Apple platforms.

---

*based on this this open source project*
https://github.com/mattmireles/kokoro-coreml


## 📦 Included Models

### 🧠 Duration Model (Stage 1)

- `kokoro_duration.mlpackage`

Handles variable-length text and predicts phoneme durations + intermediate features.

---

### 🔊 HAR Decoder Buckets (Stage 2 – ANE Optimized)

Fixed-size audio synthesis models:

- `KokoroDecoder_HAR_1s.mlpackage`
- `KokoroDecoder_HAR_2s.mlpackage`
- `KokoroDecoder_HAR_3s.mlpackage`
- `KokoroDecoder_HAR_5s.mlpackage`
- `KokoroDecoder_HAR_8s.mlpackage`
- `KokoroDecoder_HAR_10s.mlpackage`
- `KokoroDecoder_HAR_15s.mlpackage`
- `KokoroDecoder_HAR_20s.mlpackage`
- `KokoroDecoder_HAR.mlpackage`

---

### 🔁 Decoder-Only Variants

- `kokoro_decoder_only_3s.mlpackage`
- `kokoro_decoder_only_5s.mlpackage`
- `kokoro_decoder_only_10s.mlpackage`

---

### 🎛 F0 / Feature Variants

- `kokoro_f0n_3s.mlpackage`
- `kokoro_f0n_5s.mlpackage`
- `kokoro_f0n_10s.mlpackage`

---

### 🔊 Vocoder Variants

- `KokoroVocoder.mlpackage`
- `KokoroVocoder_asr64_f0128.mlpackage`
- `KokoroVocoder_asr80_f0160.mlpackage`
- `KokoroVocoder_asr96_f0192.mlpackage`
- `KokoroVocoder_asr128_f0256.mlpackage`
- `KokoroVocoder_asr160_f0320.mlpackage`
- `KokoroVocoder_asr200_f0400.mlpackage`

---

### 🧪 Experimental / Alternative

- `kokoro_synthesizer_3s.mlpackage`
- `kokoro_synthesizer_3s_nolstm.mlpackage`
- `StyleTTS2_iSTFTNet_Decoder.mlpackage`

---

# 📐 Architecture

This CoreML conversion uses a **two-stage pipeline** to support Kokoro’s dynamic operations while maximizing ANE performance.

---

## Stage 1 — Duration Model (CPU/GPU)

**Input:** Variable-length text (`ct.RangeDim`)  
**Process:** Transformer + LSTM duration prediction  
**Output:** Phoneme durations + intermediate features  
**Compute:** CPU / GPU  

Why CPU?
- LSTM layers are not ANE-compatible
- Dynamic shape text processing

---

## Stage 2 — HAR Decoder (ANE Optimized)

**Input:**
- Features from duration model  
- Alignment matrix (built client-side)

**Process:**  
Vocoder synthesis using iSTFTNet architecture

**Output:**  
24kHz waveform audio

**Compute:**  
Apple Neural Engine

---

# 🚀 Key Innovations

- **HAR Processing** – Harmonic/phase separation for ANE efficiency  
- **Fixed-size Buckets** – Avoid CoreML dynamic shape issues  
- **Client-side Alignment** – Swift/Python builds alignment matrix  
- **On-demand Model Loading** – Memory optimized  
- **MIL Graph Patching** – CoreML compatibility fixes  

---

# ⚡ Performance

### Runs on ANE (HAR Models)

- Conv1D  
- ConvTranspose1D  
- LeakyReLU  
- Element-wise ops  

**Result:** ~17× faster than real-time synthesis

---

### Runs on CPU/GPU (Duration Model)

- LSTM layers  
- Transformer attention  
- AdaLayerNorm  
- Dynamic shape processing  

---

# 🧠 Production Optimizations

- Bucket auto-selection  
- ~200MB per loaded model  
- Warm-up optimization  
- Graceful bucket fallback  
- Memory cleanup during idle  

---

# 📥 Downloading

Because `.mlpackage` is a folder, download using Hugging Face CLI:

```bash
huggingface-cli download <username>/<repo> --local-dir .

---
license: mit
---