Kokoro 82M TTS -- Surgically Optimized for Apple Silicon

30 seconds of speech in 379 ms on a Mac Studio. 2x faster than MLX on the same hardware. Running on the Apple Neural Engine.

Kokoro-82M compiled to Core ML and cut into five models, each running on the processor that's best at its job. On-device, offline, no API keys, no cents-per-character. This repo is the pre-converted .mlpackage files; you load them with a Swift MLModel(contentsOf:) call.

Source, exporters, Swift runtime: github.com/mattmireles/kokoro-coreml

The numbers

Median warm wall time for one full synthesize call -- tokenize in, 24 kHz PCM out. Measured June 2026, counterbalanced harness in the GitHub repo.

Audio M2 Studio (64 GB) M2 Air (24 GB) M1 Mini (16 GB)
3s 51 ms 148 ms 234 ms
10s 126 ms 466 ms 686 ms
30s 379 ms 1,405 ms 1,959 ms

That's 12-79x realtime across the lineup. The M2 Studio synthesizes 30 seconds of audio in 379 ms, but the M1 Mini is the number that matters -- the cheapest Apple Silicon Mac you can buy turns text into speech 14x faster than you can listen to it.

vs MLX

Same machines, same utterances, same voice (af_heart), same timing boundary, median of warm calls. Comparator: Blaizzy/mlx-audio 0.4.3 at commit 862dfbe, running mlx-community/Kokoro-82M-bf16.

Audio M2 Studio M2 Air M1 Mini
3s 51 ms vs error 148 ms vs error 234 ms vs error
7s 96 vs 224 ms -- 2.3x 331 vs 686 ms -- 2.1x 493 vs 824 ms -- 1.7x
10s 126 vs 289 ms -- 2.3x 466 vs 836 ms -- 1.8x 686 vs 1,124 ms -- 1.6x
30s 379 vs 763 ms -- 2.0x 1,405 vs 2,600 ms -- 1.9x 1,959 vs 3,078 ms -- 1.6x

Faster on every bucket, on every machine. The gap is widest on the newest silicon -- the Neural Engine keeps scaling while a GPU-bound port doesn't. (The pinned MLX version fails 3-second clips with a broadcast-shape error; no time to report.)

This is not a knock on MLX -- it's a fine framework. It's the surgery. A monolithic port runs wherever the scheduler drops it. A dissected pipeline runs each stage where it belongs.

On iPhone

Same .mlpackage files, deployed to phones (June 2026, iOS 26.5, median of 5 warm calls). Comparator: mlalma/kokoro-ios 1.0.8, the MLX Swift port of Kokoro -- the Python mlx-audio above doesn't run on iOS. Its timing includes its Misaki G2P pass; ours starts from token IDs.

Audio iPhone 15 Pro Max iPhone 12 Pro
3s 702 vs 919 ms -- 1.3x 1,383 vs 1,624 ms -- 1.2x
7s 1,492 vs 1,875 ms -- 1.3x 2,966 vs 2,405 ms -- 0.8x
15s 3,272 vs 3,805 ms -- 1.2x 6,250 vs 5,022 ms -- 0.8x
30s 6,374 vs 7,792 ms -- 1.2x 12,301 ms vs OOM

Faster on every bucket on the A17 Pro (4-4.5x realtime). On the 4 GB iPhone 12 Pro it's split: MLX takes the middle buckets, but the memory watchdog kills it on 30-second clips -- this pipeline synthesizes them in 12.3 s. One disclosure: the iPhone ANE compiler (A14 and A17 Pro) rejects the full-ANE plan that every M-series Mac runs (ANECCompile() FAILED), so iPhone rows use the staged policy -- decoder-pre on the ANE, the other stages on CPU+GPU.

Cold start takes a few seconds (Core ML compiles on first load); everything after is steady-state. Benchmarks drift with macOS and hardware -- rerun them on your target machine with the harness in the GitHub repo before you ship a claim of your own.

Why surgery?

Apple Silicon isn't one processor. It's three -- CPU, GPU, and the Neural Engine (ANE) -- each built for different work. The ANE devours fixed-shape convolutions at a fraction of the GPU's power draw. But it has rules: no dynamic shapes, no data-dependent control flow. Shove a whole TTS model through Core ML and the scheduler quietly dumps you on the CPU.

So we cut the pipeline at the joints:

                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
"Hello world" ──▢ β”‚  DURATION  (kokoro_duration_t*) β”‚ ◀── CPU/GPU
                  β”‚  BERT + LSTMs                   β”‚     branching, variable lengths
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  ALIGNMENT  (Swift)             β”‚ ◀── CPU
                  β”‚  Matrix from durations, ~50 LoC β”‚     small, data-dependent logic
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  F0 / NOISE  (kokoro_f0ntrain)  β”‚ ◀── ANE
                  β”‚  Pitch + aperiodicity contours  β”‚     fixed-shape dense math
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  DECODER PRE (kokoro_decoder_pre)β”‚ ◀── ANE
                  β”‚  Text features β†’ decoder state  β”‚     fixed-shape convolutions
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  HARMONIC SOURCE  (Swift/vDSP)  β”‚ ◀── CPU
                  β”‚  hn-NSF sine + noise excitation β”‚     cheap DSP, exact phase
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚  GENERATOR (kokoro_decoder_     β”‚ ◀── ANE
                  β”‚  har_post) convs + iSTFT        β”‚     dense parallel tensor math
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
                            24 kHz Audio

Four models on the ANE, one DSP stage in Swift with double-precision phase accumulation. The generator has zero nn.Linear ops -- all 48 replaced with Conv1d(kernel_size=1) so the MIL graph stays on the ANE path.

Redesign the inference pipeline, not the model. That's where the 2x over MLX comes from -- not by fighting the GPU, but by routing around it.

What's in the download

Five fixed-duration buckets: 3s, 7s, 10s, 15s, 30s. Pick the smallest bucket that fits your predicted utterance. That's the whole strategy.

File What it does Runs on
kokoro_duration_t{32,64,128,256,320,384,512}.mlpackage Phoneme durations + text/style encodings, one per padded token length CPU/GPU
kokoro_duration.mlpackage Legacy single duration model (fallback) CPU/GPU
kokoro_f0ntrain_t{120,280,400,600,1200}.mlpackage Pitch + noise prediction, one per bucket's frame count ANE
kokoro_decoder_pre_{3,7,10,15,30}s.mlpackage Text features β†’ decoder hidden state ANE
kokoro_decoder_har_post_{3,7,10,15,30}s.mlpackage Generator: harmonic-excited convolutions + iSTFT β†’ waveform ANE

The alignment matrix and the hn-NSF harmonic source are not models -- they're a few hundred lines of Swift/vDSP in the GitHub repo's KokoroPipeline.

Usage (Swift)

import KokoroPipeline  // from the GitHub repo

let pipeline = try KokoroPipeline(modelsDirectory: modelsURL)
let result = try pipeline.synthesize(text: "Hello world", voice: "af_heart")
// result.waveform: 24 kHz mono PCM, trimmed to actual speech length

Or load the packages raw and orchestrate yourself -- the full glue (tokenizer, alignment builder, harmonic source, PCM playback) is in the GitHub repo. It's small. You can read it in an afternoon.

Tensor shapes (3s bucket)

kokoro_duration_t128:
  in   input_ids       [1, 128]        int32     phoneme token IDs (padded)
  in   attention_mask  [1, 128]        float16
  in   ref_s           [1, 256]        float16   voice embedding
  in   speed           [1]             float16
  out  pred_dur        [1, 128]                  per-token frame counts
  out  t_en, d, s, ref_s_out                     encodings for downstream stages

kokoro_f0ntrain_t120:
  in   en   [1, 640, 120]   out  F0_pred [1, 240], N_pred [1, 240]

kokoro_decoder_pre_3s:
  in   asr [1, 512, 120]  f0 [1, 1, 240]  n_input [1, 1, 240]  ref_s [1, 256]
  out  x_pre [1, 512, 240]

kokoro_decoder_har_post_3s:
  in   x_pre [1, 512, 240]  ref_s [1, 256]  har [1, 22, 28801]
  out  waveform [1, 1, 72000]   -- 3s @ 24 kHz

Everything is static and float16. No dynamic ops. No RangeDim. No non_zero kernels.

Requirements

  • iOS 16+ / macOS 13+ (MLProgram + modern Core ML runtime)
  • Apple Silicon (M1+) or A15+ for Neural Engine acceleration
  • Runs on older chips too, just slower

License

Apache 2.0, inherited from Kokoro-82M. Ship it. Sell it. Fork it.

Credits

  • @hexgrad -- Kokoro-82M weights, training, and the Apache release
  • @yl4579 -- StyleTTS 2 architecture
  • Apple's coremltools team -- for maintaining the PyTorch-to-Core ML path

Kokoro (εΏƒ) -- Japanese for "heart."

Downloads last month
1,069
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mattmireles/kokoro-coreml

Quantized
(47)
this model