license: apache-2.0
language:
- en
library_name: coreai
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- core-ai
- coreml
- on-device
- styletts2
- kokoro
base_model: hexgrad/Kokoro-82M
Kokoro-82M — Core AI
hexgrad/Kokoro-82M (Apache-2.0), a tiny
high-quality StyleTTS2 + iSTFTNet text-to-speech model (82M params, 24 kHz),
converted to Apple Core AI (.aimodel, iOS 27 / macOS 27) — the
CoreAI-Model-Zoo's first TTS.
Non-autoregressive: phonemes + a voice/style vector → a waveform in one pass. Runs fully on-device, English-first, with grapheme→phoneme on the host.
Use it
▶️ Run it (source) — the Speak runner (GUI + CLI, one app for every text-to-speech model in the catalog):
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/Speak/Speak.xcodeproj
# → Run, then pick "Kokoro 82M" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/Speak
swift run speak-cli --model kokoro-82m --text "Hello from Core AI." --output hello.wav
💻 Build with it — complete; the glue is kit API, copy-paste runs:
import CoreAIKit
let speaker = try await KitSpeaker(catalog: "kokoro-82m")
let audio = try await speaker.synthesize(text)
// audio.samples: 24 kHz mono PCM in [-1, 1] — play it or write a WAV
The take-home is Examples/Speak/Sources/QuickStart.swift
— this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same KitSpeaker(catalog:) and plays the samples.
English-first: G2P is a dictionary over the bundled misaki lexicons (~180k words);
out-of-dictionary words are letter-spelled (no neural fallback). 28 voices ride the
download — af_heart is the default; the underlying KokoroTTS takes a voice:
label. Streaming? synthesizeStreaming(_:onChunk:) hands you a chunk per sentence.
Integration checklist
- SPM:
https://github.com/john-rocky/coreai-kit→ product CoreAIKit - Info.plist: none needed
- Entitlements: none needed
- First run downloads the model — 0.3 GB (Mac) — then it loads from the
local cache (Application Support; progress via the
downloadProgresscallback) - Measure in Release — Debug is ~3× slower on per-token host work
Bundles
The acoustic graph has one data-dependent length (the duration→alignment expansion),
so it is cut into three voice-independent .aimodel bundles with two cheap host
steps between them:
| file | in → out |
|---|---|
kokoro_predictor.aimodel |
input_ids[1,128] i32, ref_s[1,256], attn_mask[1,128] → duration, d, t_en |
kokoro_prosody.aimodel |
d, t_en, aln[1,128,512], ref_s, frame_mask[1,512] → asr, F0, N |
kokoro_vocoder.aimodel |
asr, F0, N, har, ref_s, frame_mask → audio[1, L·600] |
voices/*.pt — the 28 English voice packs (Apache-2.0). The voice is the ref_s
input: ref_s = pack[len(ids)−1]. Quality leaders: af_heart, af_bella,
af_nicole, bf_emma.
Token length T and frame length L are fixed buckets (128 / 512); the host left-pads to the bucket and trims the output. Longer text is split into sentences host-side. Run on the Core AI CPU compute unit. ~0.75 s / utterance on M4 Max, ~335 MB total (fp32).
Host steps
text ──(misaki G2P)──▶ ids ──▶ predictor ──▶ [build alignment] ──▶ prosody
──▶ [har = STFT(SineGen(f0_upsamp(F0)))] ──▶ vocoder ──▶ [trim] ──▶ 24 kHz audio
G2P is misaki (misaki[en], no espeak for
English); on-device MisakiSwift gives the same
English phonemes. har (the hn-nsf source's STFT) is a windowed FFT computed on the
host — the one piece that must stay off the engine (its atan2 phase flips 2π at the
F0→0 pad boundary under fp32).
Quality
The hn-nsf source phase is arbitrary (stock Kokoro randomizes it), so the gate is
spectral: magnitude-spectrogram correlation 0.999 vs the PyTorch reference
(af_heart, multiple sentences). Raw waveform correlation ~0.98 — the bounded,
inaudible effect of the bucket pad boundary.
Convert / re-bucket
conversion/export_kokoro.py
(python export_kokoro.py --out-dir out; --verify runs the engine-vs-torch spectral
gate; --token-bucket / --frame-bucket to re-size). Card + the full port write-up:
zoo/kokoro-82m.md.
License
Apache-2.0 (model weights and the 28 English voices). The Core AI export code derives
from Apple's BSD-3-Clause coreai_models.