Omnilingual ASR β€” CTC 300M (MLX 4-bit)

MLX-compatible 4-bit quantization of Meta's Omnilingual ASR CTC-300M model, targeting on-device inference on Apple Silicon (M1/M2/M3/M4).

Omnilingual ASR is a wav2vec 2.0–style encoder-only model with a linear CTC head, trained by Meta for speech recognition across 1,600+ languages. The CTC variant is language-agnostic at inference time (no language hint needed).

Model

Parameters 326 M
Format MLX safetensors (quantized linear layers + fp16 features)
Quantization 4-bit per-group min-max, group size 64
Sample rate 16 kHz (raw waveform input)
Frame rate 50 fps (320Γ— downsampling in CNN frontend)
Max duration 40 s
Languages 1,600+
Vocabulary 10,288 SentencePiece tokens

Files

File Size Description
model.safetensors 193 MB 4-bit quantized transformer weights + fp16 conv frontend
tokenizer.model 1.2 MB SentencePiece tokenizer (unk=3, pad=1, eos=2, bos=0)
config.json <1 KB Architecture + quantization metadata

Architecture

Raw audio [1, samples]
  β†’ Wav2Vec2FeatureExtractor (7-layer 1D conv, stride 320Γ—)
  β†’ Linear 512 β†’ 1024
  β†’ Wav2Vec2PositionEncoder (weight-normalized conv, kernel 128, groups 16)
  β†’ 24 Γ— StandardTransformerEncoderLayer (pre-norm, dim 1024, heads 16, ffn 4096)
  β†’ LayerNorm
  β†’ Linear 1024 β†’ 10288   (CTC head)
  β†’ logits [1, T/320, 10288]

CTC greedy decoding with duplicate collapsing over the argmax path.

Performance

FLEURS test set, CTC-300M fp32 on CPU (Apple M-series), 30 utterances/language, aggregate WER via exact-edit-distance scorer (no external text normalization):

Language WER Audio Inference RTF
English (en_us) 20.0% 289 s 16.3 s 0.056
French (fr_fr) 23.2% 334 s 19.5 s 0.059
German (de_de) 16.5% 361 s 20.8 s 0.058
Arabic (ar_eg) 19.5% 331 s 17.0 s 0.051
Hindi (hi_in) 22.5% 364 s 18.2 s 0.050

Aggregate CPU RTF β‰ˆ 0.05; on M-series GPU via MLX, expect RTF < 0.02. (4-bit quantization typically adds <1% absolute WER on wav2vec2-class models; treat these as close upper bounds for the quantized variant.)

Usage

import mlx.core as mx
from mlx.utils import tree_unflatten
from safetensors import safe_open

weights = {}
with safe_open("model.safetensors", framework="mlx") as f:
    for k in f.keys():
        weights[k] = f.get_tensor(k)

# Your MLX wav2vec2 + CTC implementation consumes these keys.
# Expected input : float32 audio [1, samples] at 16 kHz, zero-mean unit-var
# Expected output: logits [1, T, 10288] then CTC greedy decode via the
#                  tokenizer in tokenizer.model

Swift inference is provided by speech-swift (see Sources/OmnilingualASR/).

Source

Links

License

Apache 2.0 (inherited from upstream).


Downloads last month
85
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/Omnilingual-ASR-CTC-300M-MLX-4bit

Finetuned
(2)
this model

Collection including aufklarer/Omnilingual-ASR-CTC-300M-MLX-4bit

Paper for aufklarer/Omnilingual-ASR-CTC-300M-MLX-4bit