Kokoro-82M — Core AI

hexgrad/Kokoro-82M (Apache-2.0), a tiny high-quality StyleTTS2 + iSTFTNet text-to-speech model (82M params, 24 kHz), converted to Apple Core AI (.aimodel, iOS 27 / macOS 27) — the CoreAI-Model-Zoo's first TTS.

Non-autoregressive: phonemes + a voice/style vector → a waveform in one pass. Runs fully on-device, English-first, with grapheme→phoneme on the host.

Bundles

The acoustic graph has one data-dependent length (the duration→alignment expansion), so it is cut into three voice-independent .aimodel bundles with two cheap host steps between them:

file	in → out
`kokoro_predictor.aimodel`	`input_ids[1,128]` i32, `ref_s[1,256]`, `attn_mask[1,128]` → `duration`, `d`, `t_en`
`kokoro_prosody.aimodel`	`d`, `t_en`, `aln[1,128,512]`, `ref_s`, `frame_mask[1,512]` → `asr`, `F0`, `N`
`kokoro_vocoder.aimodel`	`asr`, `F0`, `N`, `har`, `ref_s`, `frame_mask` → `audio[1, L·600]`

voices/*.pt — the 28 English voice packs (Apache-2.0). The voice is the ref_s input: ref_s = pack[len(ids)−1]. Quality leaders: af_heart, af_bella, af_nicole, bf_emma.

Token length T and frame length L are fixed buckets (128 / 512); the host left-pads to the bucket and trims the output. Longer text is split into sentences host-side. Run on the Core AI CPU compute unit. ~0.75 s / utterance on M4 Max, ~335 MB total (fp32).

Host steps

text ──(misaki G2P)──▶ ids ──▶ predictor ──▶ [build alignment] ──▶ prosody
     ──▶ [har = STFT(SineGen(f0_upsamp(F0)))] ──▶ vocoder ──▶ [trim] ──▶ 24 kHz audio

G2P is misaki (misaki[en], no espeak for English); on-device MisakiSwift gives the same English phonemes. har (the hn-nsf source's STFT) is a windowed FFT computed on the host — the one piece that must stay off the engine (its atan2 phase flips 2π at the F0→0 pad boundary under fp32).

Quality

The hn-nsf source phase is arbitrary (stock Kokoro randomizes it), so the gate is spectral: magnitude-spectrogram correlation 0.999 vs the PyTorch reference (af_heart, multiple sentences). Raw waveform correlation ~0.98 — the bounded, inaudible effect of the bucket pad boundary.

Convert / re-bucket

conversion/export_kokoro.py (python export_kokoro.py --out-dir out; --verify runs the engine-vs-torch spectral gate; --token-bucket / --frame-bucket to re-size). Card + the full port write-up: zoo/kokoro-82m.md.

License

Apache-2.0 (model weights and the 28 English voices). The Core AI export code derives from Apple's BSD-3-Clause coreai_models.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/Kokoro-82M-CoreAI

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M

Finetuned

(33)

this model