kokoro-82m-coreml / ANE /README.md
alexwengg's picture
Upload 61 files
cd0f1dd verified
---
license: mit
language:
- en
library_name: coreml
pipeline_tag: text-to-speech
tags:
- coreml
- tts
- kokoro
- apple-silicon
- ane
- on-device
---
# Kokoro 82M β€” laishere CoreML port (7-stage, ANE-optimized)
CoreML conversion of [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) split into a **7-stage chain** for Apple Neural Engine residency, originally produced by [@laishere](https://github.com/laishere/kokoro-coreml) (MIT). Repackaged here for use with [FluidAudio](https://github.com/FluidInference/FluidAudio).
## What's in this repo
Both `.mlpackage` (source) and `.mlmodelc` (compiled, runtime-ready) formats ship in this repo. Loaders that auto-compile (e.g. `xcrun coremlcompiler`, `MLModel.compileModel(at:)`) can use the `.mlpackage`; FluidAudio loads the `.mlmodelc` directly to skip Apple's first-run compile step.
| Stage | `.mlpackage` | `.mlmodelc` | Format | Compute target |
|---|---|---|---|---|
| `KokoroAlbert` | 5.6 MB | 5.6 MB | fp16 + int8 palettization | CPU + ANE |
| `KokoroPostAlbert` | 13 MB | 13 MB | fp16 + int8 palettization | CPU + ANE |
| `KokoroAlignment` | 20 KB | 32 KB | fp16 + int8 palettization | CPU + ANE |
| `KokoroProsody` | 8.1 MB | 8.2 MB | fp32 | CPU + GPU |
| `KokoroNoise` | 4.4 MB | 4.5 MB | fp32 | CPU + GPU |
| `KokoroVocoder` | 47 MB | 47 MB | fp16 + int8 palettization | CPU + ANE |
| `KokoroTail` | 92 KB | 100 KB | fp32 (iSTFT) | CPU + GPU |
Plus auxiliary files:
| File | Description | Size |
|---|---|---|
| `vocab.json` | 114 IPA β†’ token IDs | 1.4 KB |
| `af_heart.bin` | flat fp32 `[510, 256]` voice pack | 512 KB |
Total: **~157 MB** with both formats (~78 MB if you keep only `.mlmodelc`, vs the original ~330 MB PyTorch weights).
## Pipeline
```
text β†’ G2P (out-of-tree, e.g. FluidAudio's BART G2P)
β†’ IPA tokens [BOS, ..., EOS] (max 512)
β†’ Albert β†’ hidden states
β†’ PostAlbert β†’ text features
β†’ Alignment β†’ T_a frames (dynamic)
β†’ Prosody β†’ pitch + duration
β†’ Noise β†’ noise embeddings (fp16β†’fp32 boundary)
β†’ Vocoder β†’ x_pre features (discard `anchor` output)
β†’ Tail (iSTFT) β†’ 24 kHz waveform
```
Voice pack is indexed by `row = clamp(T_enc - 1, 0, 509)`; columns `[0:128]` = timbre, `[128:256]` = style_s.
## Performance (Apple M2, 8-core)
| Stage | Steady-state |
|---|---|
| Albert | 7-10 ms |
| PostAlbert | 4-5 ms |
| Alignment | 1-2 ms |
| Prosody | 30-200 ms |
| Noise | 70-150 ms |
| Vocoder | 75-125 ms |
| Tail | 6-22 ms |
Cold model load (first run, `anecompilerservice` compilation): **~20 s**. Warm load: **~300 ms**. Steady-state RTFx: **3-11Γ—** depending on phrase length.
## Usage with FluidAudio
```bash
swift run fluidaudiocli tts "Hello world" \
--backend kokoro-lai \
--output hello.wav \
--metrics metrics.json
```
```swift
import FluidAudio
let manager = KokoroLaiManager()
try await manager.initialize()
let wav = try await manager.synthesize(text: "Hello world")
```
FluidAudio downloads this repo automatically into `~/.cache/fluidaudio/Models/kokoro-laishere/` on first use.
## Conversion
Built with [mobius/models/tts/kokoro/laishere-coreml](https://github.com/FluidInference/mobius/tree/main/models/tts/kokoro/laishere-coreml) (PyTorch 2.11 + coremltools 9.0). Reproduce:
```bash
cd mobius/models/tts/kokoro/laishere-coreml
uv sync
uv pip install --reinstall coremltools==9.0 # workaround sdist fallback
uv run python convert-coreml.py --output-dir build/laishere-kokoro
uv run python dump-benchmark-data.py --output-dir build/laishere-kokoro
for mlp in build/laishere-kokoro/Kokoro*.mlpackage; do
xcrun coremlcompiler compile "$mlp" build/laishere-kokoro-compiled/
done
```
Parity vs PyTorch reference: waveform corr β‰₯ 0.80, mel-spectrogram corr β‰₯ 0.99 (verified by `compare-models.py`).
## Voices
This release ships only `af_heart` (American Female, "Heart"). Additional voices from [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) can be re-exported by editing `dump-benchmark-data.py`'s `VOICE` constant and copying the resulting `<voice>.bin` here.
## License
MIT β€” inherited from upstream:
- Model weights: [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) (Apache 2.0)
- CoreML conversion code + 7-stage architecture: [laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml) (MIT, Lai Yongkang 2025)
- Repackaging: FluidInference (MIT)
See `LICENSE` for the upstream MIT text.
## Citation
```bibtex
@misc{kokoro-laishere-coreml,
title = {Kokoro 82M β€” 7-stage CoreML conversion for Apple Neural Engine},
author = {Lai, Yongkang and FluidInference},
year = {2025},
url = {https://huggingface.co/FluidInference/kokoro-laishere-coreml}
}
```