philippdxx
/

Kokoro-CoreML-Experimental-Models

Model card Files Files and versions

Kokoro-CoreML-Experimental-Models / README.md

philippdxx's picture

udpatred

640f759 verified about 2 months ago

|

history blame contribute delete

3.43 kB

	# Kokoro CoreML (HAR-Optimized)

	High-performance Kokoro TTS CoreML conversion with Apple Neural Engine (ANE) optimized HAR decoder buckets.

	This repository contains precompiled `.mlpackage` models for fast on-device speech synthesis on Apple platforms.

	---

	based on this this open source project
	https://github.com/mattmireles/kokoro-coreml



	## 📦 Included Models

	### 🧠 Duration Model (Stage 1)

	- `kokoro_duration.mlpackage`

	Handles variable-length text and predicts phoneme durations + intermediate features.

	---

	### 🔊 HAR Decoder Buckets (Stage 2 – ANE Optimized)

	Fixed-size audio synthesis models:

	- `KokoroDecoder_HAR_1s.mlpackage`
	- `KokoroDecoder_HAR_2s.mlpackage`
	- `KokoroDecoder_HAR_3s.mlpackage`
	- `KokoroDecoder_HAR_5s.mlpackage`
	- `KokoroDecoder_HAR_8s.mlpackage`
	- `KokoroDecoder_HAR_10s.mlpackage`
	- `KokoroDecoder_HAR_15s.mlpackage`
	- `KokoroDecoder_HAR_20s.mlpackage`
	- `KokoroDecoder_HAR.mlpackage`

	---

	### 🔁 Decoder-Only Variants

	- `kokoro_decoder_only_3s.mlpackage`
	- `kokoro_decoder_only_5s.mlpackage`
	- `kokoro_decoder_only_10s.mlpackage`

	---

	### 🎛 F0 / Feature Variants

	- `kokoro_f0n_3s.mlpackage`
	- `kokoro_f0n_5s.mlpackage`
	- `kokoro_f0n_10s.mlpackage`

	---

	### 🔊 Vocoder Variants

	- `KokoroVocoder.mlpackage`
	- `KokoroVocoder_asr64_f0128.mlpackage`
	- `KokoroVocoder_asr80_f0160.mlpackage`
	- `KokoroVocoder_asr96_f0192.mlpackage`
	- `KokoroVocoder_asr128_f0256.mlpackage`
	- `KokoroVocoder_asr160_f0320.mlpackage`
	- `KokoroVocoder_asr200_f0400.mlpackage`

	---

	### 🧪 Experimental / Alternative

	- `kokoro_synthesizer_3s.mlpackage`
	- `kokoro_synthesizer_3s_nolstm.mlpackage`
	- `StyleTTS2_iSTFTNet_Decoder.mlpackage`

	---

	# 📐 Architecture

	This CoreML conversion uses a two-stage pipeline to support Kokoro’s dynamic operations while maximizing ANE performance.

	---

	## Stage 1 — Duration Model (CPU/GPU)

	Input: Variable-length text (`ct.RangeDim`)
	Process: Transformer + LSTM duration prediction
	Output: Phoneme durations + intermediate features
	Compute: CPU / GPU

	Why CPU?
	- LSTM layers are not ANE-compatible
	- Dynamic shape text processing

	---

	## Stage 2 — HAR Decoder (ANE Optimized)

	Input:
	- Features from duration model
	- Alignment matrix (built client-side)

	Process:
	Vocoder synthesis using iSTFTNet architecture

	Output:
	24kHz waveform audio

	Compute:
	Apple Neural Engine

	---

	# 🚀 Key Innovations

	- HAR Processing – Harmonic/phase separation for ANE efficiency
	- Fixed-size Buckets – Avoid CoreML dynamic shape issues
	- Client-side Alignment – Swift/Python builds alignment matrix
	- On-demand Model Loading – Memory optimized
	- MIL Graph Patching – CoreML compatibility fixes

	---

	# ⚡ Performance

	### Runs on ANE (HAR Models)

	- Conv1D
	- ConvTranspose1D
	- LeakyReLU
	- Element-wise ops

	Result: ~17× faster than real-time synthesis

	---

	### Runs on CPU/GPU (Duration Model)

	- LSTM layers
	- Transformer attention
	- AdaLayerNorm
	- Dynamic shape processing

	---

	# 🧠 Production Optimizations

	- Bucket auto-selection
	- ~200MB per loaded model
	- Warm-up optimization
	- Graceful bucket fallback
	- Memory cleanup during idle

	---

	# 📥 Downloading

	Because `.mlpackage` is a folder, download using Hugging Face CLI:

	```bash
	huggingface-cli download <username>/<repo> --local-dir .

	---
	license: mit
	---