license: mit
language:
- en
library_name: coreml
pipeline_tag: text-to-speech
tags:
- coreml
- tts
- kokoro
- apple-silicon
- ane
- on-device
Kokoro 82M β laishere CoreML port (7-stage, ANE-optimized)
CoreML conversion of hexgrad/Kokoro-82M split into a 7-stage chain for Apple Neural Engine residency, originally produced by @laishere (MIT). Repackaged here for use with FluidAudio.
What's in this repo
Both .mlpackage (source) and .mlmodelc (compiled, runtime-ready) formats ship in this repo. Loaders that auto-compile (e.g. xcrun coremlcompiler, MLModel.compileModel(at:)) can use the .mlpackage; FluidAudio loads the .mlmodelc directly to skip Apple's first-run compile step.
| Stage | .mlpackage |
.mlmodelc |
Format | Compute target |
|---|---|---|---|---|
KokoroAlbert |
5.6 MB | 5.6 MB | fp16 + int8 palettization | CPU + ANE |
KokoroPostAlbert |
13 MB | 13 MB | fp16 + int8 palettization | CPU + ANE |
KokoroAlignment |
20 KB | 32 KB | fp16 + int8 palettization | CPU + ANE |
KokoroProsody |
8.1 MB | 8.2 MB | fp32 | CPU + GPU |
KokoroNoise |
4.4 MB | 4.5 MB | fp32 | CPU + GPU |
KokoroVocoder |
47 MB | 47 MB | fp16 + int8 palettization | CPU + ANE |
KokoroTail |
92 KB | 100 KB | fp32 (iSTFT) | CPU + GPU |
Plus auxiliary files:
| File | Description | Size |
|---|---|---|
vocab.json |
114 IPA β token IDs | 1.4 KB |
af_heart.bin |
flat fp32 [510, 256] voice pack |
512 KB |
Total: ~157 MB with both formats (~78 MB if you keep only .mlmodelc, vs the original ~330 MB PyTorch weights).
Pipeline
text β G2P (out-of-tree, e.g. FluidAudio's BART G2P)
β IPA tokens [BOS, ..., EOS] (max 512)
β Albert β hidden states
β PostAlbert β text features
β Alignment β T_a frames (dynamic)
β Prosody β pitch + duration
β Noise β noise embeddings (fp16βfp32 boundary)
β Vocoder β x_pre features (discard `anchor` output)
β Tail (iSTFT) β 24 kHz waveform
Voice pack is indexed by row = clamp(T_enc - 1, 0, 509); columns [0:128] = timbre, [128:256] = style_s.
Performance (Apple M2, 8-core)
| Stage | Steady-state |
|---|---|
| Albert | 7-10 ms |
| PostAlbert | 4-5 ms |
| Alignment | 1-2 ms |
| Prosody | 30-200 ms |
| Noise | 70-150 ms |
| Vocoder | 75-125 ms |
| Tail | 6-22 ms |
Cold model load (first run, anecompilerservice compilation): ~20 s. Warm load: ~300 ms. Steady-state RTFx: 3-11Γ depending on phrase length.
Usage with FluidAudio
swift run fluidaudiocli tts "Hello world" \
--backend kokoro-lai \
--output hello.wav \
--metrics metrics.json
import FluidAudio
let manager = KokoroLaiManager()
try await manager.initialize()
let wav = try await manager.synthesize(text: "Hello world")
FluidAudio downloads this repo automatically into ~/.cache/fluidaudio/Models/kokoro-laishere/ on first use.
Conversion
Built with mobius/models/tts/kokoro/laishere-coreml (PyTorch 2.11 + coremltools 9.0). Reproduce:
cd mobius/models/tts/kokoro/laishere-coreml
uv sync
uv pip install --reinstall coremltools==9.0 # workaround sdist fallback
uv run python convert-coreml.py --output-dir build/laishere-kokoro
uv run python dump-benchmark-data.py --output-dir build/laishere-kokoro
for mlp in build/laishere-kokoro/Kokoro*.mlpackage; do
xcrun coremlcompiler compile "$mlp" build/laishere-kokoro-compiled/
done
Parity vs PyTorch reference: waveform corr β₯ 0.80, mel-spectrogram corr β₯ 0.99 (verified by compare-models.py).
Voices
This release ships only af_heart (American Female, "Heart"). Additional voices from hexgrad/Kokoro-82M can be re-exported by editing dump-benchmark-data.py's VOICE constant and copying the resulting <voice>.bin here.
License
MIT β inherited from upstream:
- Model weights: hexgrad/Kokoro-82M (Apache 2.0)
- CoreML conversion code + 7-stage architecture: laishere/kokoro-coreml (MIT, Lai Yongkang 2025)
- Repackaging: FluidInference (MIT)
See LICENSE for the upstream MIT text.
Citation
@misc{kokoro-laishere-coreml,
title = {Kokoro 82M β 7-stage CoreML conversion for Apple Neural Engine},
author = {Lai, Yongkang and FluidInference},
year = {2025},
url = {https://huggingface.co/FluidInference/kokoro-laishere-coreml}
}