| --- |
| license: apache-2.0 |
| language: |
| - en |
| library_name: coreai |
| pipeline_tag: text-to-speech |
| tags: |
| - text-to-speech |
| - tts |
| - core-ai |
| - coreml |
| - on-device |
| - styletts2 |
| - kokoro |
| base_model: hexgrad/Kokoro-82M |
| --- |
| |
| # Kokoro-82M — Core AI |
|
|
| [`hexgrad/Kokoro-82M`](https://huggingface.co/hexgrad/Kokoro-82M) (Apache-2.0), a tiny |
| high-quality **StyleTTS2 + iSTFTNet** text-to-speech model (82M params, 24 kHz), |
| converted to Apple **Core AI** (`.aimodel`, iOS 27 / macOS 27) — the |
| [CoreAI-Model-Zoo](https://github.com/john-rocky/coreai-model-zoo)'s first TTS. |
|
|
| Non-autoregressive: phonemes + a voice/style vector → a waveform in one pass. |
| Runs fully on-device, English-first, with grapheme→phoneme on the host. |
|
|
| <!-- gen-cards:use-it begin id=kokoro-82m (managed by scripts/gen-cards — edit cards.json / QuickStart.swift, not this block) --> |
| ## Use it |
|
|
| ▶️ **Run it (source)** — the [Speak runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/Speak) |
| (GUI + CLI, one app for every text-to-speech model in the catalog): |
|
|
| ```bash |
| git clone https://github.com/john-rocky/coreai-kit |
| open coreai-kit/Examples/Speak/Speak.xcodeproj |
| # → Run, then pick "Kokoro 82M" in the model picker |
| |
| # agents / headless (macOS): |
| cd coreai-kit/Examples/Speak |
| swift run speak-cli --model kokoro-82m --text "Hello from Core AI." --output hello.wav |
| ``` |
|
|
| 💻 **Build with it** — complete; the glue is kit API, copy-paste runs: |
|
|
| ```swift |
| import CoreAIKit |
| |
| let speaker = try await KitSpeaker(catalog: "kokoro-82m") |
| let audio = try await speaker.synthesize(text) |
| // audio.samples: 24 kHz mono PCM in [-1, 1] — play it or write a WAV |
| ``` |
|
|
| The take-home is [`Examples/Speak/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/Speak/Sources/QuickStart.swift) |
| — this exact code as one typed function, no UI; the CLI is an argument shell over it, and |
| the GUI drives the same `KitSpeaker(catalog:)` and plays the samples. |
| English-first: G2P is a dictionary over the bundled misaki lexicons (~180k words); |
| out-of-dictionary words are letter-spelled (no neural fallback). 28 voices ride the |
| download — `af_heart` is the default; the underlying `KokoroTTS` takes a `voice:` |
| label. Streaming? `synthesizeStreaming(_:onChunk:)` hands you a chunk per sentence. |
|
|
| **Integration checklist** |
|
|
| - SPM: `https://github.com/john-rocky/coreai-kit` → product **CoreAIKit** |
| - Info.plist: none needed |
| - Entitlements: none needed |
| - First run downloads the model — 0.3 GB (Mac) — then it loads from the |
| local cache (Application Support; progress via the `downloadProgress` callback) |
| - Measure in Release — Debug is ~3× slower on per-token host work |
| <!-- gen-cards:use-it end --> |
|
|
| ## Bundles |
|
|
| The acoustic graph has one data-dependent length (the duration→alignment expansion), |
| so it is cut into **three voice-independent `.aimodel` bundles** with two cheap host |
| steps between them: |
|
|
| | file | in → out | |
| |---|---| |
| | `kokoro_predictor.aimodel` | `input_ids[1,128]` i32, `ref_s[1,256]`, `attn_mask[1,128]` → `duration`, `d`, `t_en` | |
| | `kokoro_prosody.aimodel` | `d`, `t_en`, `aln[1,128,512]`, `ref_s`, `frame_mask[1,512]` → `asr`, `F0`, `N` | |
| | `kokoro_vocoder.aimodel` | `asr`, `F0`, `N`, `har`, `ref_s`, `frame_mask` → `audio[1, L·600]` | |
|
|
| `voices/*.pt` — the **28 English voice packs** (Apache-2.0). The voice is the `ref_s` |
| input: `ref_s = pack[len(ids)−1]`. Quality leaders: `af_heart`, `af_bella`, |
| `af_nicole`, `bf_emma`. |
|
|
| Token length **T** and frame length **L** are fixed **buckets** (128 / 512); the host |
| left-pads to the bucket and trims the output. Longer text is split into sentences |
| host-side. Run on the Core AI **CPU** compute unit. ~0.75 s / utterance on M4 Max, |
| ~335 MB total (fp32). |
|
|
| ## Host steps |
|
|
| ``` |
| text ──(misaki G2P)──▶ ids ──▶ predictor ──▶ [build alignment] ──▶ prosody |
| ──▶ [har = STFT(SineGen(f0_upsamp(F0)))] ──▶ vocoder ──▶ [trim] ──▶ 24 kHz audio |
| ``` |
|
|
| G2P is [misaki](https://github.com/hexgrad/misaki) (`misaki[en]`, no espeak for |
| English); on-device [MisakiSwift](https://github.com/mlalma/MisakiSwift) gives the same |
| English phonemes. `har` (the hn-nsf source's STFT) is a windowed FFT computed on the |
| host — the one piece that must stay off the engine (its `atan2` phase flips 2π at the |
| F0→0 pad boundary under fp32). |
|
|
| ## Quality |
|
|
| The hn-nsf source phase is arbitrary (stock Kokoro randomizes it), so the gate is |
| spectral: **magnitude-spectrogram correlation 0.999** vs the PyTorch reference |
| (`af_heart`, multiple sentences). Raw waveform correlation ~0.98 — the bounded, |
| inaudible effect of the bucket pad boundary. |
|
|
| ## Convert / re-bucket |
|
|
| [`conversion/export_kokoro.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_kokoro.py) |
| (`python export_kokoro.py --out-dir out`; `--verify` runs the engine-vs-torch spectral |
| gate; `--token-bucket` / `--frame-bucket` to re-size). Card + the full port write-up: |
| [`zoo/kokoro-82m.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/kokoro-82m.md). |
|
|
| ## License |
|
|
| Apache-2.0 (model weights and the 28 English voices). The Core AI export code derives |
| from Apple's BSD-3-Clause `coreai_models`. |
|
|