---
license: mit
language:
  - en
library_name: coreml
pipeline_tag: text-to-speech
tags:
  - coreml
  - tts
  - kokoro
  - apple-silicon
  - ane
  - on-device
---

# Kokoro 82M — laishere CoreML port (7-stage, ANE-optimized)

CoreML conversion of [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) split into a **7-stage chain** for Apple Neural Engine residency, originally produced by [@laishere](https://github.com/laishere/kokoro-coreml) (MIT). Repackaged here for use with [FluidAudio](https://github.com/FluidInference/FluidAudio).

## What's in this repo

Both `.mlpackage` (source) and `.mlmodelc` (compiled, runtime-ready) formats ship in this repo. Loaders that auto-compile (e.g. `xcrun coremlcompiler`, `MLModel.compileModel(at:)`) can use the `.mlpackage`; FluidAudio loads the `.mlmodelc` directly to skip Apple's first-run compile step.

| Stage | `.mlpackage` | `.mlmodelc` | Format | Compute target |
|---|---|---|---|---|
| `KokoroAlbert` | 5.6 MB | 5.6 MB | fp16 + int8 palettization | CPU + ANE |
| `KokoroPostAlbert` | 13 MB | 13 MB | fp16 + int8 palettization | CPU + ANE |
| `KokoroAlignment` | 20 KB | 32 KB | fp16 + int8 palettization | CPU + ANE |
| `KokoroProsody` | 8.1 MB | 8.2 MB | fp32 | CPU + GPU |
| `KokoroNoise` | 4.4 MB | 4.5 MB | fp32 | CPU + GPU |
| `KokoroVocoder` | 47 MB | 47 MB | fp16 + int8 palettization | CPU + ANE |
| `KokoroTail` | 92 KB | 100 KB | fp32 (iSTFT) | CPU + GPU |

Plus auxiliary files:

| File | Description | Size |
|---|---|---|
| `vocab.json` | 114 IPA → token IDs | 1.4 KB |
| `af_heart.bin` | flat fp32 `[510, 256]` voice pack | 512 KB |

Total: **~157 MB** with both formats (~78 MB if you keep only `.mlmodelc`, vs the original ~330 MB PyTorch weights).

## Pipeline

```
text → G2P (out-of-tree, e.g. FluidAudio's BART G2P)
     → IPA tokens [BOS, ..., EOS]   (max 512)
     → Albert        →  hidden states
     → PostAlbert    →  text features
     → Alignment     →  T_a frames (dynamic)
     → Prosody       →  pitch + duration
     → Noise         →  noise embeddings  (fp16→fp32 boundary)
     → Vocoder       →  x_pre features    (discard `anchor` output)
     → Tail (iSTFT)  →  24 kHz waveform
```

Voice pack is indexed by `row = clamp(T_enc - 1, 0, 509)`; columns `[0:128]` = timbre, `[128:256]` = style_s.

## Performance (Apple M2, 8-core)

| Stage | Steady-state |
|---|---|
| Albert | 7-10 ms |
| PostAlbert | 4-5 ms |
| Alignment | 1-2 ms |
| Prosody | 30-200 ms |
| Noise | 70-150 ms |
| Vocoder | 75-125 ms |
| Tail | 6-22 ms |

Cold model load (first run, `anecompilerservice` compilation): **~20 s**. Warm load: **~300 ms**. Steady-state RTFx: **3-11×** depending on phrase length.

## Usage with FluidAudio

```bash
swift run fluidaudiocli tts "Hello world" \
    --backend kokoro-lai \
    --output hello.wav \
    --metrics metrics.json
```

```swift
import FluidAudio

let manager = KokoroLaiManager()
try await manager.initialize()
let wav = try await manager.synthesize(text: "Hello world")
```

FluidAudio downloads this repo automatically into `~/.cache/fluidaudio/Models/kokoro-laishere/` on first use.

## Conversion

Built with [mobius/models/tts/kokoro/laishere-coreml](https://github.com/FluidInference/mobius/tree/main/models/tts/kokoro/laishere-coreml) (PyTorch 2.11 + coremltools 9.0). Reproduce:

```bash
cd mobius/models/tts/kokoro/laishere-coreml
uv sync
uv pip install --reinstall coremltools==9.0    # workaround sdist fallback
uv run python convert-coreml.py --output-dir build/laishere-kokoro
uv run python dump-benchmark-data.py --output-dir build/laishere-kokoro
for mlp in build/laishere-kokoro/Kokoro*.mlpackage; do
    xcrun coremlcompiler compile "$mlp" build/laishere-kokoro-compiled/
done
```

Parity vs PyTorch reference: waveform corr ≥ 0.80, mel-spectrogram corr ≥ 0.99 (verified by `compare-models.py`).

## Voices

This release ships only `af_heart` (American Female, "Heart"). Additional voices from [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) can be re-exported by editing `dump-benchmark-data.py`'s `VOICE` constant and copying the resulting `<voice>.bin` here.

## License

MIT — inherited from upstream:
- Model weights: [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) (Apache 2.0)
- CoreML conversion code + 7-stage architecture: [laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml) (MIT, Lai Yongkang 2025)
- Repackaging: FluidInference (MIT)

See `LICENSE` for the upstream MIT text.

## Citation

```bibtex
@misc{kokoro-laishere-coreml,
    title  = {Kokoro 82M — 7-stage CoreML conversion for Apple Neural Engine},
    author = {Lai, Yongkang and FluidInference},
    year   = {2025},
    url    = {https://huggingface.co/FluidInference/kokoro-laishere-coreml}
}
```