Omnilingual ASR β€” CTC 1B (MLX 4-bit)

MLX-compatible 4-bit quantization of Meta's Omnilingual ASR CTC-1B model for on-device inference on Apple Silicon (M1/M2/M3/M4). The 1B variant trades ~360 MB of extra disk vs. the 300M build for meaningfully better accuracy on low-resource languages (per Meta's published CER/WER curves on FLEURS).

Omnilingual ASR is a wav2vec 2.0-style encoder-only model with a linear CTC head, trained by Meta for speech recognition across 1,600+ languages. The CTC variant is language-agnostic at inference time.

Model

Parameters 1.01 B
Format MLX safetensors (quantized linear layers + fp16 features)
Quantization 4-bit per-group min-max, group size 64
Encoder layers 48
Encoder dim 1280
Attention heads 20
FFN dim 5120
Sample rate 16 kHz (raw waveform input)
Frame rate 50 fps
Max duration 40 s
Languages 1,600+
Vocabulary 10,288 SentencePiece tokens

Files

File Size Description
model.safetensors 549 MB 4-bit quantized transformer weights + fp16 conv frontend
tokenizer.model 1.2 MB SentencePiece tokenizer (unk=3, pad=1, eos=2, bos=0)
config.json <1 KB Architecture + quantization metadata

Architecture

Raw audio [1, samples]
  β†’ Wav2Vec2FeatureExtractor (7-layer 1D conv, stride 320Γ—)
  β†’ Linear 512 β†’ 1280
  β†’ Wav2Vec2PositionEncoder (weight-normalized conv, kernel 128, groups 16)
  β†’ 48 Γ— StandardTransformerEncoderLayer (pre-norm, dim 1280, heads 20, ffn 5120)
  β†’ LayerNorm
  β†’ Linear 1280 β†’ 10288   (CTC head)
  β†’ logits [1, T/320, 10288]

CTC greedy decoding with duplicate collapsing over the argmax path.

Performance

Meta's upstream omniASR-CTC-1B model card reports substantial improvements over CTC-300M on low-resource languages. Our MLX 4-bit export preserves the same architecture and should land within ~1% absolute WER of fp32 on wav2vec2-class models.

For direct comparison with CTC-300M on FLEURS see the 300M 4-bit card.

Usage

import mlx.core as mx
from safetensors import safe_open

weights = {}
with safe_open("model.safetensors", framework="mlx") as f:
    for k in f.keys():
        weights[k] = f.get_tensor(k)

# Your MLX wav2vec2 + CTC implementation consumes these keys.
# Input : float32 audio [1, samples] at 16 kHz, zero-mean unit-var
# Output: logits [1, T, 10288] then CTC greedy decode via tokenizer.model

Swift inference is provided by speech-swift.

Source

Links

License

Apache 2.0 (inherited from upstream).


Downloads last month
52
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/Omnilingual-ASR-CTC-1B-MLX-4bit

Finetuned
(2)
this model

Collection including aufklarer/Omnilingual-ASR-CTC-1B-MLX-4bit

Paper for aufklarer/Omnilingual-ASR-CTC-1B-MLX-4bit