Spaces:
Running
Running
| title: Sonic Speech | |
| emoji: π€ | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: static | |
| pinned: false | |
| # Sonic Speech | |
| Optimized speech models for Apple Silicon, powering [Sonic](https://github.com/flight505/sonic-workspace) β a local-first voice AI | |
| system. All models run entirely on-device using [MLX](https://github.com/ml-explore/mlx). No cloud, no API keys, no data leaves your | |
| Mac. | |
| ## ASR β Parakeet TDT (NVIDIA, ported to MLX) | |
| SOTA English speech recognition with encoder-only mixed-precision quantization. | |
| | Model | Size | WER (LibriSpeech) | WER (TED-LIUM) | RTFx | Peak Memory | | |
| |-------|------|-------------------|-----------------|------|-------------| | |
| | [parakeet-tdt-0.6b-v3](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v3) | 1,254 MB | 0.82% | 15.1% | 73x | 3,002 MB | | |
| | [parakeet-tdt-0.6b-v3-int8](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v3-int8) | 755 MB | 0.82% | 15.1% | 95x | 1,268 | |
| MB | | |
| | [parakeet-tdt-0.6b-v3-int4](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v3-int4) | 489 MB | 0.82% | 15.5% | 98x | 1,003 | |
| MB | | |
| | [parakeet-tdt-0.6b-v2](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v2) | 1,222 MB | β | β | β | β | | |
| | [parakeet-tdt-0.6b-v2-int8](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v2-int8) | 736 MB | β | β | β | β | | |
| | [parakeet-tdt-0.6b-v2-int4](https://huggingface.co/sonic-speech/parakeet-tdt-0.6b-v2-int4) | 470 MB | β | β | β | β | | |
| **v3** supports 25 languages. **v2** is English-only. **INT8 recommended** β zero WER loss, 40% smaller, 30% faster. | |
| ## TTS β Kokoro 82M (MLX) | |
| Fast text-to-speech with 32+ voices (American, British, Japanese, Chinese). | |
| | Model | Size | Short Text | Medium Text | TTFC (streaming) | RTFx | | |
| |-------|------|------------|-------------|------------------|------| | |
| | [kokoro-82m-bf16](https://huggingface.co/sonic-speech/kokoro-82m-bf16) | ~170 MB | 47 ms | 224 ms | 126 ms | 41x | | |
| ## Quantization Strategy | |
| Only the Conformer encoder (~85% of params) is quantized β the decoder stays BF16 for token precision. | |
| | Variant | Size | Speed | Memory | WER Impact | | |
| |---------|------|-------|--------|------------| | |
| | INT8 | -40% | +30% | -58% | None | | |
| | INT4 | -61% | +34% | -67% | +0.4pp on real speech | | |
| ## Quick Start | |
| ```python | |
| # ASR | |
| from parakeet import from_pretrained | |
| model = from_pretrained("sonic-speech/parakeet-tdt-0.6b-v3-int8") | |
| # TTS | |
| from sonic_tts import SonicTTS | |
| tts = SonicTTS(voice="af_heart") | |
| All benchmarks: Apple M3 Max 64 GB, macOS Sequoia, MLX 0.30.4. Built by https://huggingface.co/flight505. | |
| ``` |