sonic-speech/kokoro-82m-bf16
Text-to-Speech • Updated
• 44
speech-recognition, asr, mlx, apple-silicon, dictation, text-to-speech
Optimized speech models for Apple Silicon, powering Sonic — a local-first voice AI system. All models run entirely on-device using MLX. No cloud, no API keys, no data leaves your Mac.
SOTA English speech recognition with encoder-only mixed-precision quantization.
| Model | Size | WER (LibriSpeech) | WER (TED-LIUM) | RTFx | Peak Memory |
|---|---|---|---|---|---|
| parakeet-tdt-0.6b-v3 | 1,254 MB | 0.82% | 15.1% | 73x | 3,002 MB |
| parakeet-tdt-0.6b-v3-int8 | 755 MB | 0.82% | 15.1% | 95x | 1,268 |
| MB | |||||
| parakeet-tdt-0.6b-v3-int4 | 489 MB | 0.82% | 15.5% | 98x | 1,003 |
| MB | |||||
| parakeet-tdt-0.6b-v2 | 1,222 MB | — | — | — | — |
| parakeet-tdt-0.6b-v2-int8 | 736 MB | — | — | — | — |
| parakeet-tdt-0.6b-v2-int4 | 470 MB | — | — | — | — |
v3 supports 25 languages. v2 is English-only. INT8 recommended — zero WER loss, 40% smaller, 30% faster.
Fast text-to-speech with 32+ voices (American, British, Japanese, Chinese).
| Model | Size | Short Text | Medium Text | TTFC (streaming) | RTFx |
|---|---|---|---|---|---|
| kokoro-82m-bf16 | ~170 MB | 47 ms | 224 ms | 126 ms | 41x |
Only the Conformer encoder (~85% of params) is quantized — the decoder stays BF16 for token precision.
| Variant | Size | Speed | Memory | WER Impact |
|---|---|---|---|---|
| INT8 | -40% | +30% | -58% | None |
| INT4 | -61% | +34% | -67% | +0.4pp on real speech |
# ASR
from parakeet import from_pretrained
model = from_pretrained("sonic-speech/parakeet-tdt-0.6b-v3-int8")
# TTS
from sonic_tts import SonicTTS
tts = SonicTTS(voice="af_heart")
All benchmarks: Apple M3 Max 64 GB, macOS Sequoia, MLX 0.30.4. Built by https://huggingface.co/flight505.