| --- |
| license: mit |
| language: |
| - en |
| library_name: coreml |
| pipeline_tag: text-to-speech |
| tags: |
| - coreml |
| - tts |
| - kokoro |
| - apple-silicon |
| - ane |
| - on-device |
| --- |
| |
| # Kokoro 82M β laishere CoreML port (7-stage, ANE-optimized) |
|
|
| CoreML conversion of [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) split into a **7-stage chain** for Apple Neural Engine residency, originally produced by [@laishere](https://github.com/laishere/kokoro-coreml) (MIT). Repackaged here for use with [FluidAudio](https://github.com/FluidInference/FluidAudio). |
|
|
| ## What's in this repo |
|
|
| Both `.mlpackage` (source) and `.mlmodelc` (compiled, runtime-ready) formats ship in this repo. Loaders that auto-compile (e.g. `xcrun coremlcompiler`, `MLModel.compileModel(at:)`) can use the `.mlpackage`; FluidAudio loads the `.mlmodelc` directly to skip Apple's first-run compile step. |
|
|
| | Stage | `.mlpackage` | `.mlmodelc` | Format | Compute target | |
| |---|---|---|---|---| |
| | `KokoroAlbert` | 5.6 MB | 5.6 MB | fp16 + int8 palettization | CPU + ANE | |
| | `KokoroPostAlbert` | 13 MB | 13 MB | fp16 + int8 palettization | CPU + ANE | |
| | `KokoroAlignment` | 20 KB | 32 KB | fp16 + int8 palettization | CPU + ANE | |
| | `KokoroProsody` | 8.1 MB | 8.2 MB | fp32 | CPU + GPU | |
| | `KokoroNoise` | 4.4 MB | 4.5 MB | fp32 | CPU + GPU | |
| | `KokoroVocoder` | 47 MB | 47 MB | fp16 + int8 palettization | CPU + ANE | |
| | `KokoroTail` | 92 KB | 100 KB | fp32 (iSTFT) | CPU + GPU | |
|
|
| Plus auxiliary files: |
|
|
| | File | Description | Size | |
| |---|---|---| |
| | `vocab.json` | 114 IPA β token IDs | 1.4 KB | |
| | `af_heart.bin` | flat fp32 `[510, 256]` voice pack | 512 KB | |
|
|
| Total: **~157 MB** with both formats (~78 MB if you keep only `.mlmodelc`, vs the original ~330 MB PyTorch weights). |
|
|
| ## Pipeline |
|
|
| ``` |
| text β G2P (out-of-tree, e.g. FluidAudio's BART G2P) |
| β IPA tokens [BOS, ..., EOS] (max 512) |
| β Albert β hidden states |
| β PostAlbert β text features |
| β Alignment β T_a frames (dynamic) |
| β Prosody β pitch + duration |
| β Noise β noise embeddings (fp16βfp32 boundary) |
| β Vocoder β x_pre features (discard `anchor` output) |
| β Tail (iSTFT) β 24 kHz waveform |
| ``` |
|
|
| Voice pack is indexed by `row = clamp(T_enc - 1, 0, 509)`; columns `[0:128]` = timbre, `[128:256]` = style_s. |
| |
| ## Performance (Apple M2, 8-core) |
| |
| | Stage | Steady-state | |
| |---|---| |
| | Albert | 7-10 ms | |
| | PostAlbert | 4-5 ms | |
| | Alignment | 1-2 ms | |
| | Prosody | 30-200 ms | |
| | Noise | 70-150 ms | |
| | Vocoder | 75-125 ms | |
| | Tail | 6-22 ms | |
| |
| Cold model load (first run, `anecompilerservice` compilation): **~20 s**. Warm load: **~300 ms**. Steady-state RTFx: **3-11Γ** depending on phrase length. |
| |
| ## Usage with FluidAudio |
| |
| ```bash |
| swift run fluidaudiocli tts "Hello world" \ |
| --backend kokoro-lai \ |
| --output hello.wav \ |
| --metrics metrics.json |
| ``` |
| |
| ```swift |
| import FluidAudio |
| |
| let manager = KokoroLaiManager() |
| try await manager.initialize() |
| let wav = try await manager.synthesize(text: "Hello world") |
| ``` |
| |
| FluidAudio downloads this repo automatically into `~/.cache/fluidaudio/Models/kokoro-laishere/` on first use. |
| |
| ## Conversion |
| |
| Built with [mobius/models/tts/kokoro/laishere-coreml](https://github.com/FluidInference/mobius/tree/main/models/tts/kokoro/laishere-coreml) (PyTorch 2.11 + coremltools 9.0). Reproduce: |
| |
| ```bash |
| cd mobius/models/tts/kokoro/laishere-coreml |
| uv sync |
| uv pip install --reinstall coremltools==9.0 # workaround sdist fallback |
| uv run python convert-coreml.py --output-dir build/laishere-kokoro |
| uv run python dump-benchmark-data.py --output-dir build/laishere-kokoro |
| for mlp in build/laishere-kokoro/Kokoro*.mlpackage; do |
| xcrun coremlcompiler compile "$mlp" build/laishere-kokoro-compiled/ |
| done |
| ``` |
| |
| Parity vs PyTorch reference: waveform corr β₯ 0.80, mel-spectrogram corr β₯ 0.99 (verified by `compare-models.py`). |
| |
| ## Voices |
| |
| This release ships only `af_heart` (American Female, "Heart"). Additional voices from [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) can be re-exported by editing `dump-benchmark-data.py`'s `VOICE` constant and copying the resulting `<voice>.bin` here. |
|
|
| ## License |
|
|
| MIT β inherited from upstream: |
| - Model weights: [hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) (Apache 2.0) |
| - CoreML conversion code + 7-stage architecture: [laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml) (MIT, Lai Yongkang 2025) |
| - Repackaging: FluidInference (MIT) |
|
|
| See `LICENSE` for the upstream MIT text. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{kokoro-laishere-coreml, |
| title = {Kokoro 82M β 7-stage CoreML conversion for Apple Neural Engine}, |
| author = {Lai, Yongkang and FluidInference}, |
| year = {2025}, |
| url = {https://huggingface.co/FluidInference/kokoro-laishere-coreml} |
| } |
| ``` |
|
|