philippdxx's picture
udpatred
640f759 verified
# Kokoro CoreML (HAR-Optimized)
High-performance **Kokoro TTS CoreML conversion** with Apple Neural Engine (ANE) optimized HAR decoder buckets.
This repository contains precompiled `.mlpackage` models for fast on-device speech synthesis on Apple platforms.
---
*based on this this open source project*
https://github.com/mattmireles/kokoro-coreml
## πŸ“¦ Included Models
### 🧠 Duration Model (Stage 1)
- `kokoro_duration.mlpackage`
Handles variable-length text and predicts phoneme durations + intermediate features.
---
### πŸ”Š HAR Decoder Buckets (Stage 2 – ANE Optimized)
Fixed-size audio synthesis models:
- `KokoroDecoder_HAR_1s.mlpackage`
- `KokoroDecoder_HAR_2s.mlpackage`
- `KokoroDecoder_HAR_3s.mlpackage`
- `KokoroDecoder_HAR_5s.mlpackage`
- `KokoroDecoder_HAR_8s.mlpackage`
- `KokoroDecoder_HAR_10s.mlpackage`
- `KokoroDecoder_HAR_15s.mlpackage`
- `KokoroDecoder_HAR_20s.mlpackage`
- `KokoroDecoder_HAR.mlpackage`
---
### πŸ” Decoder-Only Variants
- `kokoro_decoder_only_3s.mlpackage`
- `kokoro_decoder_only_5s.mlpackage`
- `kokoro_decoder_only_10s.mlpackage`
---
### πŸŽ› F0 / Feature Variants
- `kokoro_f0n_3s.mlpackage`
- `kokoro_f0n_5s.mlpackage`
- `kokoro_f0n_10s.mlpackage`
---
### πŸ”Š Vocoder Variants
- `KokoroVocoder.mlpackage`
- `KokoroVocoder_asr64_f0128.mlpackage`
- `KokoroVocoder_asr80_f0160.mlpackage`
- `KokoroVocoder_asr96_f0192.mlpackage`
- `KokoroVocoder_asr128_f0256.mlpackage`
- `KokoroVocoder_asr160_f0320.mlpackage`
- `KokoroVocoder_asr200_f0400.mlpackage`
---
### πŸ§ͺ Experimental / Alternative
- `kokoro_synthesizer_3s.mlpackage`
- `kokoro_synthesizer_3s_nolstm.mlpackage`
- `StyleTTS2_iSTFTNet_Decoder.mlpackage`
---
# πŸ“ Architecture
This CoreML conversion uses a **two-stage pipeline** to support Kokoro’s dynamic operations while maximizing ANE performance.
---
## Stage 1 β€” Duration Model (CPU/GPU)
**Input:** Variable-length text (`ct.RangeDim`)
**Process:** Transformer + LSTM duration prediction
**Output:** Phoneme durations + intermediate features
**Compute:** CPU / GPU
Why CPU?
- LSTM layers are not ANE-compatible
- Dynamic shape text processing
---
## Stage 2 β€” HAR Decoder (ANE Optimized)
**Input:**
- Features from duration model
- Alignment matrix (built client-side)
**Process:**
Vocoder synthesis using iSTFTNet architecture
**Output:**
24kHz waveform audio
**Compute:**
Apple Neural Engine
---
# πŸš€ Key Innovations
- **HAR Processing** – Harmonic/phase separation for ANE efficiency
- **Fixed-size Buckets** – Avoid CoreML dynamic shape issues
- **Client-side Alignment** – Swift/Python builds alignment matrix
- **On-demand Model Loading** – Memory optimized
- **MIL Graph Patching** – CoreML compatibility fixes
---
# ⚑ Performance
### Runs on ANE (HAR Models)
- Conv1D
- ConvTranspose1D
- LeakyReLU
- Element-wise ops
**Result:** ~17Γ— faster than real-time synthesis
---
### Runs on CPU/GPU (Duration Model)
- LSTM layers
- Transformer attention
- AdaLayerNorm
- Dynamic shape processing
---
# 🧠 Production Optimizations
- Bucket auto-selection
- ~200MB per loaded model
- Warm-up optimization
- Graceful bucket fallback
- Memory cleanup during idle
---
# πŸ“₯ Downloading
Because `.mlpackage` is a folder, download using Hugging Face CLI:
```bash
huggingface-cli download <username>/<repo> --local-dir .
---
license: mit
---