MLX Speech Models
Collection
Speech AI models for Apple Silicon via MLX. ASR, TTS, VAD, diarization, speaker embedding. β’ 33 items β’ Updated β’ 2
MLX-compatible 4-bit quantization of Meta's Omnilingual ASR CTC-7B model for on-device inference on Apple Silicon (M3 Pro / M4 Pro (16+ GB unified memory) recommended). Trades ~1 GB of extra disk versus CTC-3B 4-bit for measurably better accuracy on low-resource languages per Meta's published FLEURS results.
Omnilingual ASR is a wav2vec 2.0-style encoder-only model with a linear CTC head, trained by Meta for speech recognition across 1,600+ languages. The CTC variant is language-agnostic at inference time.
| Parameters | ~7 B |
| Format | MLX safetensors (quantized linear layers + fp16 features) |
| Quantization | 4-bit per-group min-max, group size 64 |
| Sample rate | 16 kHz (raw waveform input) |
| Frame rate | 50 fps |
| Max duration | 40 s |
| Languages | 1,600+ |
| Vocabulary | 10,288 SentencePiece tokens |
Full architecture details (num_layers / model_dim / ffn_dim) are in
config.json.
| File | Description |
|---|---|
model.safetensors |
4-bit quantized transformer weights + fp16 conv frontend |
tokenizer.model |
SentencePiece tokenizer |
config.json |
Architecture + quantization metadata |
import mlx.core as mx
from safetensors import safe_open
weights = {}
with safe_open("model.safetensors", framework="mlx") as f:
for k in f.keys():
weights[k] = f.get_tensor(k)
Swift inference is provided by speech-swift.
Apache 2.0 (inherited from upstream).
Quantized
Base model
facebook/omniASR-CTC-7B