File size: 2,637 Bytes
f813cc9
 
 
5af3702
2511174
 
 
 
 
ecff087
5af3702
 
 
 
 
ecff087
5af3702
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ecff087
5af3702
 
 
 
 
 
 
ecff087
5af3702
 
 
 
 
 
 
 
ecff087
5af3702
 
 
 
 
 
 
 
 
 
 
f813cc9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
title: Sonic Speech
emoji: 🎀
colorFrom: purple
colorTo: blue
sdk: static
pinned: false
---

# Sonic Speech

  Optimized speech models for Apple Silicon, powering [Sonic](https://github.com/flight505/sonic-workspace) β€” a local-first voice AI
  system. All models run entirely on-device using [MLX](https://github.com/ml-explore/mlx). No cloud, no API keys, no data leaves your
   Mac.

## ASR β€” Parakeet TDT (NVIDIA, ported to MLX)

  SOTA English speech recognition with encoder-only mixed-precision quantization.

  | Model | Size | WER (LibriSpeech) | WER (TED-LIUM) | RTFx | Peak Memory |
  |-------|------|-------------------|-----------------|------|-------------|
  | [parakeet-tdt-0.6b-v3](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v3) | 1,254 MB | 0.82% | 15.1% | 73x | 3,002 MB |
  | [parakeet-tdt-0.6b-v3-int8](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v3-int8) | 755 MB | 0.82% | 15.1% | 95x | 1,268
  MB |
  | [parakeet-tdt-0.6b-v3-int4](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v3-int4) | 489 MB | 0.82% | 15.5% | 98x | 1,003
  MB |
  | [parakeet-tdt-0.6b-v2](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v2) | 1,222 MB | β€” | β€” | β€” | β€” |
  | [parakeet-tdt-0.6b-v2-int8](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v2-int8) | 736 MB | β€” | β€” | β€” | β€” |
  | [parakeet-tdt-0.6b-v2-int4](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v2-int4) | 470 MB | β€” | β€” | β€” | β€” |

  **v3** supports 25 languages. **v2** is English-only. **INT8 recommended** β€” zero WER loss, 40% smaller, 30% faster.

## TTS β€” Kokoro 82M (MLX)

  Fast text-to-speech with 32+ voices (American, British, Japanese, Chinese).

  | Model | Size | Short Text | Medium Text | TTFC (streaming) | RTFx |
  |-------|------|------------|-------------|------------------|------|
  | [kokoro-82m-bf16](https://huggingface.co/sonic-speech/kokoro-82m-bf16) | ~170 MB | 47 ms | 224 ms | 126 ms | 41x |

## Quantization Strategy

  Only the Conformer encoder (~85% of params) is quantized β€” the decoder stays BF16 for token precision.

  | Variant | Size | Speed | Memory | WER Impact |
  |---------|------|-------|--------|------------|
  | INT8 | -40% | +30% | -58% | None |
  | INT4 | -61% | +34% | -67% | +0.4pp on real speech |

## Quick Start

  ```python
  # ASR
  from parakeet import from_pretrained
  model = from_pretrained("sonic-speech/parakeet-tdt-0.6b-v3-int8")

  # TTS
  from sonic_tts import SonicTTS
  tts = SonicTTS(voice="af_heart")

  All benchmarks: Apple M3 Max 64 GB, macOS Sequoia, MLX 0.30.4. Built by https://huggingface.co/flight505.
  ```