Omnilingual ASR β CTC 300M (MLX 4-bit)
MLX-compatible 4-bit quantization of Meta's Omnilingual ASR CTC-300M model, targeting on-device inference on Apple Silicon (M1/M2/M3/M4).
Omnilingual ASR is a wav2vec 2.0βstyle encoder-only model with a linear CTC head, trained by Meta for speech recognition across 1,600+ languages. The CTC variant is language-agnostic at inference time (no language hint needed).
Model
| Parameters | 326 M |
| Format | MLX safetensors (quantized linear layers + fp16 features) |
| Quantization | 4-bit per-group min-max, group size 64 |
| Sample rate | 16 kHz (raw waveform input) |
| Frame rate | 50 fps (320Γ downsampling in CNN frontend) |
| Max duration | 40 s |
| Languages | 1,600+ |
| Vocabulary | 10,288 SentencePiece tokens |
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
193 MB | 4-bit quantized transformer weights + fp16 conv frontend |
tokenizer.model |
1.2 MB | SentencePiece tokenizer (unk=3, pad=1, eos=2, bos=0) |
config.json |
<1 KB | Architecture + quantization metadata |
Architecture
Raw audio [1, samples]
β Wav2Vec2FeatureExtractor (7-layer 1D conv, stride 320Γ)
β Linear 512 β 1024
β Wav2Vec2PositionEncoder (weight-normalized conv, kernel 128, groups 16)
β 24 Γ StandardTransformerEncoderLayer (pre-norm, dim 1024, heads 16, ffn 4096)
β LayerNorm
β Linear 1024 β 10288 (CTC head)
β logits [1, T/320, 10288]
CTC greedy decoding with duplicate collapsing over the argmax path.
Performance
FLEURS test set, CTC-300M fp32 on CPU (Apple M-series), 30 utterances/language, aggregate WER via exact-edit-distance scorer (no external text normalization):
| Language | WER | Audio | Inference | RTF |
|---|---|---|---|---|
| English (en_us) | 20.0% | 289 s | 16.3 s | 0.056 |
| French (fr_fr) | 23.2% | 334 s | 19.5 s | 0.059 |
| German (de_de) | 16.5% | 361 s | 20.8 s | 0.058 |
| Arabic (ar_eg) | 19.5% | 331 s | 17.0 s | 0.051 |
| Hindi (hi_in) | 22.5% | 364 s | 18.2 s | 0.050 |
Aggregate CPU RTF β 0.05; on M-series GPU via MLX, expect RTF < 0.02. (4-bit quantization typically adds <1% absolute WER on wav2vec2-class models; treat these as close upper bounds for the quantized variant.)
Usage
import mlx.core as mx
from mlx.utils import tree_unflatten
from safetensors import safe_open
weights = {}
with safe_open("model.safetensors", framework="mlx") as f:
for k in f.keys():
weights[k] = f.get_tensor(k)
# Your MLX wav2vec2 + CTC implementation consumes these keys.
# Expected input : float32 audio [1, samples] at 16 kHz, zero-mean unit-var
# Expected output: logits [1, T, 10288] then CTC greedy decode via the
# tokenizer in tokenizer.model
Swift inference is provided by speech-swift
(see Sources/OmnilingualASR/).
Source
- Upstream model: facebook/omniASR-CTC-300M
- Paper: Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
- Meta blog: Omnilingual ASR announcement
Links
- speech-swift β Apple SDK
- soniqo.audio β website
- blog
License
Apache 2.0 (inherited from upstream).
- Guide: soniqo.audio/guides/omnilingual
- Docs: soniqo.audio
- GitHub: soniqo/speech-swift
- Downloads last month
- 85
Quantized
Model tree for aufklarer/Omnilingual-ASR-CTC-300M-MLX-4bit
Base model
facebook/omniASR-CTC-300M