Spaces:

sonic-speech
/

README

Running

File size: 2,637 Bytes

---
title: Sonic Speech
emoji: 🎤
colorFrom: purple
colorTo: blue
sdk: static
pinned: false
---

# Sonic Speech

  Optimized speech models for Apple Silicon, powering [Sonic](https://github.com/flight505/sonic-workspace) — a local-first voice AI
  system. All models run entirely on-device using [MLX](https://github.com/ml-explore/mlx). No cloud, no API keys, no data leaves your
   Mac.

## ASR — Parakeet TDT (NVIDIA, ported to MLX)

  SOTA English speech recognition with encoder-only mixed-precision quantization.

  | Model | Size | WER (LibriSpeech) | WER (TED-LIUM) | RTFx | Peak Memory |
  |-------|------|-------------------|-----------------|------|-------------|
  | [parakeet-tdt-0.6b-v3](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v3) | 1,254 MB | 0.82% | 15.1% | 73x | 3,002 MB |
  | [parakeet-tdt-0.6b-v3-int8](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v3-int8) | 755 MB | 0.82% | 15.1% | 95x | 1,268
  MB |
  | [parakeet-tdt-0.6b-v3-int4](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v3-int4) | 489 MB | 0.82% | 15.5% | 98x | 1,003
  MB |
  | [parakeet-tdt-0.6b-v2](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v2) | 1,222 MB | — | — | — | — |
  | [parakeet-tdt-0.6b-v2-int8](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v2-int8) | 736 MB | — | — | — | — |
  | [parakeet-tdt-0.6b-v2-int4](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v2-int4) | 470 MB | — | — | — | — |

  **v3** supports 25 languages. **v2** is English-only. **INT8 recommended** — zero WER loss, 40% smaller, 30% faster.

## TTS — Kokoro 82M (MLX)

  Fast text-to-speech with 32+ voices (American, British, Japanese, Chinese).

  | Model | Size | Short Text | Medium Text | TTFC (streaming) | RTFx |
  |-------|------|------------|-------------|------------------|------|
  | [kokoro-82m-bf16](https://huggingface.co/sonic-speech/kokoro-82m-bf16) | ~170 MB | 47 ms | 224 ms | 126 ms | 41x |

## Quantization Strategy

  Only the Conformer encoder (~85% of params) is quantized — the decoder stays BF16 for token precision.

  | Variant | Size | Speed | Memory | WER Impact |
  |---------|------|-------|--------|------------|
  | INT8 | -40% | +30% | -58% | None |
  | INT4 | -61% | +34% | -67% | +0.4pp on real speech |

## Quick Start

  ```python
  # ASR
  from parakeet import from_pretrained
  model = from_pretrained("sonic-speech/parakeet-tdt-0.6b-v3-int8")

  # TTS
  from sonic_tts import SonicTTS
  tts = SonicTTS(voice="af_heart")

  All benchmarks: Apple M3 Max 64 GB, macOS Sequoia, MLX 0.30.4. Built by https://huggingface.co/flight505.
  ```