gen-cards: regenerate Use-it block

37ddda1 verified 2 days ago

5.2 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: coreai
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- tts
	- core-ai
	- coreml
	- on-device
	- styletts2
	- kokoro
	base_model: hexgrad/Kokoro-82M
	---

	# Kokoro-82M — Core AI

	[`hexgrad/Kokoro-82M`](https://huggingface.co/hexgrad/Kokoro-82M) (Apache-2.0), a tiny
	high-quality StyleTTS2 + iSTFTNet text-to-speech model (82M params, 24 kHz),
	converted to Apple Core AI (`.aimodel`, iOS 27 / macOS 27) — the
	[CoreAI-Model-Zoo](https://github.com/john-rocky/coreai-model-zoo)'s first TTS.

	Non-autoregressive: phonemes + a voice/style vector → a waveform in one pass.
	Runs fully on-device, English-first, with grapheme→phoneme on the host.

	<!-- gen-cards:use-it begin id=kokoro-82m (managed by scripts/gen-cards — edit cards.json / QuickStart.swift, not this block) -->
	## Use it

	▶️ Run it (source) — the [Speak runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/Speak)
	(GUI + CLI, one app for every text-to-speech model in the catalog):

	```bash
	git clone https://github.com/john-rocky/coreai-kit
	open coreai-kit/Examples/Speak/Speak.xcodeproj
	# → Run, then pick "Kokoro 82M" in the model picker

	# agents / headless (macOS):
	cd coreai-kit/Examples/Speak
	swift run speak-cli --model kokoro-82m --text "Hello from Core AI." --output hello.wav
	```

	💻 Build with it — complete; the glue is kit API, copy-paste runs:

	```swift
	import CoreAIKit

	let speaker = try await KitSpeaker(catalog: "kokoro-82m")
	let audio = try await speaker.synthesize(text)
	// audio.samples: 24 kHz mono PCM in [-1, 1] — play it or write a WAV
	```

	The take-home is [`Examples/Speak/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/Speak/Sources/QuickStart.swift)
	— this exact code as one typed function, no UI; the CLI is an argument shell over it, and
	the GUI drives the same `KitSpeaker(catalog:)` and plays the samples.
	English-first: G2P is a dictionary over the bundled misaki lexicons (~180k words);
	out-of-dictionary words are letter-spelled (no neural fallback). 28 voices ride the
	download — `af_heart` is the default; the underlying `KokoroTTS` takes a `voice:`
	label. Streaming? `synthesizeStreaming(_:onChunk:)` hands you a chunk per sentence.

	Integration checklist

	- SPM: `https://github.com/john-rocky/coreai-kit` → product CoreAIKit
	- Info.plist: none needed
	- Entitlements: none needed
	- First run downloads the model — 0.3 GB (Mac) — then it loads from the
	local cache (Application Support; progress via the `downloadProgress` callback)
	- Measure in Release — Debug is ~3× slower on per-token host work
	<!-- gen-cards:use-it end -->

	## Bundles

	The acoustic graph has one data-dependent length (the duration→alignment expansion),
	so it is cut into three voice-independent `.aimodel` bundles with two cheap host
	steps between them:

	\| file \| in → out \|
	\|---\|---\|
	\| `kokoro_predictor.aimodel` \| `input_ids[1,128]` i32, `ref_s[1,256]`, `attn_mask[1,128]` → `duration`, `d`, `t_en` \|
	\| `kokoro_prosody.aimodel` \| `d`, `t_en`, `aln[1,128,512]`, `ref_s`, `frame_mask[1,512]` → `asr`, `F0`, `N` \|
	\| `kokoro_vocoder.aimodel` \| `asr`, `F0`, `N`, `har`, `ref_s`, `frame_mask` → `audio[1, L·600]` \|

	`voices/.pt` — the 28 English voice packs* (Apache-2.0). The voice is the `ref_s`
	input: `ref_s = pack[len(ids)−1]`. Quality leaders: `af_heart`, `af_bella`,
	`af_nicole`, `bf_emma`.

	Token length T and frame length L are fixed buckets (128 / 512); the host
	left-pads to the bucket and trims the output. Longer text is split into sentences
	host-side. Run on the Core AI CPU compute unit. ~0.75 s / utterance on M4 Max,
	~335 MB total (fp32).

	## Host steps

	```
	text ──(misaki G2P)──▶ ids ──▶ predictor ──▶ [build alignment] ──▶ prosody
	──▶ [har = STFT(SineGen(f0_upsamp(F0)))] ──▶ vocoder ──▶ [trim] ──▶ 24 kHz audio
	```

	G2P is [misaki](https://github.com/hexgrad/misaki) (`misaki[en]`, no espeak for
	English); on-device [MisakiSwift](https://github.com/mlalma/MisakiSwift) gives the same
	English phonemes. `har` (the hn-nsf source's STFT) is a windowed FFT computed on the
	host — the one piece that must stay off the engine (its `atan2` phase flips 2π at the
	F0→0 pad boundary under fp32).

	## Quality

	The hn-nsf source phase is arbitrary (stock Kokoro randomizes it), so the gate is
	spectral: magnitude-spectrogram correlation 0.999 vs the PyTorch reference
	(`af_heart`, multiple sentences). Raw waveform correlation ~0.98 — the bounded,
	inaudible effect of the bucket pad boundary.

	## Convert / re-bucket

	[`conversion/export_kokoro.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_kokoro.py)
	(`python export_kokoro.py --out-dir out`; `--verify` runs the engine-vs-torch spectral
	gate; `--token-bucket` / `--frame-bucket` to re-size). Card + the full port write-up:
	[`zoo/kokoro-82m.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/kokoro-82m.md).

	## License

	Apache-2.0 (model weights and the 28 English voices). The Core AI export code derives
	from Apple's BSD-3-Clause `coreai_models`.