Kokoro 82M TTS -- Surgically Optimized for Apple Silicon
30 seconds of speech in 379 ms on a Mac Studio. 2x faster than MLX on the same hardware. Running on the Apple Neural Engine.
Kokoro-82M compiled to Core ML and cut into five models, each running on the processor that's best at its job. On-device, offline, no API keys, no cents-per-character. This repo is the pre-converted .mlpackage files; you load them with a Swift MLModel(contentsOf:) call.
Source, exporters, Swift runtime: github.com/mattmireles/kokoro-coreml
The numbers
Median warm wall time for one full synthesize call -- tokenize in, 24 kHz PCM out. Measured June 2026, counterbalanced harness in the GitHub repo.
| Audio | M2 Studio (64 GB) | M2 Air (24 GB) | M1 Mini (16 GB) |
|---|---|---|---|
| 3s | 51 ms | 148 ms | 234 ms |
| 10s | 126 ms | 466 ms | 686 ms |
| 30s | 379 ms | 1,405 ms | 1,959 ms |
That's 12-79x realtime across the lineup. The M2 Studio synthesizes 30 seconds of audio in 379 ms, but the M1 Mini is the number that matters -- the cheapest Apple Silicon Mac you can buy turns text into speech 14x faster than you can listen to it.
vs MLX
Same machines, same utterances, same voice (af_heart), same timing boundary, median of warm calls. Comparator: Blaizzy/mlx-audio 0.4.3 at commit 862dfbe, running mlx-community/Kokoro-82M-bf16.
| Audio | M2 Studio | M2 Air | M1 Mini |
|---|---|---|---|
| 3s | 51 ms vs error | 148 ms vs error | 234 ms vs error |
| 7s | 96 vs 224 ms -- 2.3x | 331 vs 686 ms -- 2.1x | 493 vs 824 ms -- 1.7x |
| 10s | 126 vs 289 ms -- 2.3x | 466 vs 836 ms -- 1.8x | 686 vs 1,124 ms -- 1.6x |
| 30s | 379 vs 763 ms -- 2.0x | 1,405 vs 2,600 ms -- 1.9x | 1,959 vs 3,078 ms -- 1.6x |
Faster on every bucket, on every machine. The gap is widest on the newest silicon -- the Neural Engine keeps scaling while a GPU-bound port doesn't. (The pinned MLX version fails 3-second clips with a broadcast-shape error; no time to report.)
This is not a knock on MLX -- it's a fine framework. It's the surgery. A monolithic port runs wherever the scheduler drops it. A dissected pipeline runs each stage where it belongs.
On iPhone
Same .mlpackage files, deployed to phones (June 2026, iOS 26.5, median of 5 warm calls). Comparator: mlalma/kokoro-ios 1.0.8, the MLX Swift port of Kokoro -- the Python mlx-audio above doesn't run on iOS. Its timing includes its Misaki G2P pass; ours starts from token IDs.
| Audio | iPhone 15 Pro Max | iPhone 12 Pro |
|---|---|---|
| 3s | 702 vs 919 ms -- 1.3x | 1,383 vs 1,624 ms -- 1.2x |
| 7s | 1,492 vs 1,875 ms -- 1.3x | 2,966 vs 2,405 ms -- 0.8x |
| 15s | 3,272 vs 3,805 ms -- 1.2x | 6,250 vs 5,022 ms -- 0.8x |
| 30s | 6,374 vs 7,792 ms -- 1.2x | 12,301 ms vs OOM |
Faster on every bucket on the A17 Pro (4-4.5x realtime). On the 4 GB iPhone 12 Pro it's split: MLX takes the middle buckets, but the memory watchdog kills it on 30-second clips -- this pipeline synthesizes them in 12.3 s. One disclosure: the iPhone ANE compiler (A14 and A17 Pro) rejects the full-ANE plan that every M-series Mac runs (ANECCompile() FAILED), so iPhone rows use the staged policy -- decoder-pre on the ANE, the other stages on CPU+GPU.
Cold start takes a few seconds (Core ML compiles on first load); everything after is steady-state. Benchmarks drift with macOS and hardware -- rerun them on your target machine with the harness in the GitHub repo before you ship a claim of your own.
Why surgery?
Apple Silicon isn't one processor. It's three -- CPU, GPU, and the Neural Engine (ANE) -- each built for different work. The ANE devours fixed-shape convolutions at a fraction of the GPU's power draw. But it has rules: no dynamic shapes, no data-dependent control flow. Shove a whole TTS model through Core ML and the scheduler quietly dumps you on the CPU.
So we cut the pipeline at the joints:
ββββββββββββββββββββββββββββββββββ
"Hello world" βββΆ β DURATION (kokoro_duration_t*) β βββ CPU/GPU
β BERT + LSTMs β branching, variable lengths
ββββββββββββββββ¬ββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββ
β ALIGNMENT (Swift) β βββ CPU
β Matrix from durations, ~50 LoC β small, data-dependent logic
ββββββββββββββββ¬ββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββ
β F0 / NOISE (kokoro_f0ntrain) β βββ ANE
β Pitch + aperiodicity contours β fixed-shape dense math
ββββββββββββββββ¬ββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββ
β DECODER PRE (kokoro_decoder_pre)β βββ ANE
β Text features β decoder state β fixed-shape convolutions
ββββββββββββββββ¬ββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββ
β HARMONIC SOURCE (Swift/vDSP) β βββ CPU
β hn-NSF sine + noise excitation β cheap DSP, exact phase
ββββββββββββββββ¬ββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββ
β GENERATOR (kokoro_decoder_ β βββ ANE
β har_post) convs + iSTFT β dense parallel tensor math
ββββββββββββββββ¬ββββββββββββββββββ
βΌ
24 kHz Audio
Four models on the ANE, one DSP stage in Swift with double-precision phase accumulation. The generator has zero nn.Linear ops -- all 48 replaced with Conv1d(kernel_size=1) so the MIL graph stays on the ANE path.
Redesign the inference pipeline, not the model. That's where the 2x over MLX comes from -- not by fighting the GPU, but by routing around it.
What's in the download
Five fixed-duration buckets: 3s, 7s, 10s, 15s, 30s. Pick the smallest bucket that fits your predicted utterance. That's the whole strategy.
| File | What it does | Runs on |
|---|---|---|
kokoro_duration_t{32,64,128,256,320,384,512}.mlpackage |
Phoneme durations + text/style encodings, one per padded token length | CPU/GPU |
kokoro_duration.mlpackage |
Legacy single duration model (fallback) | CPU/GPU |
kokoro_f0ntrain_t{120,280,400,600,1200}.mlpackage |
Pitch + noise prediction, one per bucket's frame count | ANE |
kokoro_decoder_pre_{3,7,10,15,30}s.mlpackage |
Text features β decoder hidden state | ANE |
kokoro_decoder_har_post_{3,7,10,15,30}s.mlpackage |
Generator: harmonic-excited convolutions + iSTFT β waveform | ANE |
The alignment matrix and the hn-NSF harmonic source are not models -- they're a few hundred lines of Swift/vDSP in the GitHub repo's KokoroPipeline.
Usage (Swift)
import KokoroPipeline // from the GitHub repo
let pipeline = try KokoroPipeline(modelsDirectory: modelsURL)
let result = try pipeline.synthesize(text: "Hello world", voice: "af_heart")
// result.waveform: 24 kHz mono PCM, trimmed to actual speech length
Or load the packages raw and orchestrate yourself -- the full glue (tokenizer, alignment builder, harmonic source, PCM playback) is in the GitHub repo. It's small. You can read it in an afternoon.
Tensor shapes (3s bucket)
kokoro_duration_t128:
in input_ids [1, 128] int32 phoneme token IDs (padded)
in attention_mask [1, 128] float16
in ref_s [1, 256] float16 voice embedding
in speed [1] float16
out pred_dur [1, 128] per-token frame counts
out t_en, d, s, ref_s_out encodings for downstream stages
kokoro_f0ntrain_t120:
in en [1, 640, 120] out F0_pred [1, 240], N_pred [1, 240]
kokoro_decoder_pre_3s:
in asr [1, 512, 120] f0 [1, 1, 240] n_input [1, 1, 240] ref_s [1, 256]
out x_pre [1, 512, 240]
kokoro_decoder_har_post_3s:
in x_pre [1, 512, 240] ref_s [1, 256] har [1, 22, 28801]
out waveform [1, 1, 72000] -- 3s @ 24 kHz
Everything is static and float16. No dynamic ops. No RangeDim. No non_zero kernels.
Requirements
- iOS 16+ / macOS 13+ (MLProgram + modern Core ML runtime)
- Apple Silicon (M1+) or A15+ for Neural Engine acceleration
- Runs on older chips too, just slower
License
Apache 2.0, inherited from Kokoro-82M. Ship it. Sell it. Fork it.
Credits
- @hexgrad -- Kokoro-82M weights, training, and the Apache release
- @yl4579 -- StyleTTS 2 architecture
- Apple's coremltools team -- for maintaining the PyTorch-to-Core ML path
Kokoro (εΏ) -- Japanese for "heart."
- Downloads last month
- 1,069