Nemotron ASR MLX

NVIDIA Nemotron ASR on Apple Silicon. 112x realtime. Pure MLX.

Native MLX port of NVIDIA's Nemotron-ASR 0.6B — the cache-aware streaming conformer that processes each audio frame exactly once. No sliding windows, no recomputation, no rewinding.

93 minutes of audio transcribed in under a minute on an M-series Mac.

Quick Start

pip install nemotron-asr-mlx
from nemotron_asr_mlx import from_pretrained

model = from_pretrained("dboris/nemotron-asr-mlx")
result = model.transcribe("meeting.wav")
print(result.text)

# Beam search + ILM subtraction for maximum accuracy
result = model.transcribe("meeting.wav", beam_size=4, ilm_scale=0.15)

Model downloads automatically on first run (~1.2 GB).

Official WER (Open ASR Leaderboard datasets)

Evaluated on the standard Open ASR Leaderboard datasets. Machine: Apple M4 Max, 16-core, 64 GB.

Dataset WER NVIDIA ref RTFx
LibriSpeech test-clean 2.70% 2.31% 112x
LibriSpeech test-other 5.57% 4.75% 74.8x
TED-LIUM v3 6.25% 4.50% 75.2x

NVIDIA reference numbers are from nemotron-asr-speech-streaming-en-0.6b at 1120ms chunk size (PyTorch, A100 GPU). Our MLX port runs in batch mode on Apple Silicon.

v0.2.0: mel frontend parity fixes + blank-frame skipping decoder reduced WER from 2.79% to 2.70% and increased speed from 76x to 112x realtime. Optional beam search decoding available via beam_size parameter.

Speed Benchmark

Measured on LibriSpeech test-clean, Apple M4 Max, 64 GB.

Content Duration Inference Speed
Short utterances (~3s each, 50 samples) 4.1 min 2.7s 90x RT
Sentences (~10s each, 50 samples) 9.7 min 4.6s 128x RT
Paragraphs (~30s each, 20 samples) 10.0 min 5.3s 113x RT
Long-form (14 min single file) 14.3 min 8.2s 105x RT
Full test-clean (2,620 samples) 5.4 hours 173s 112x RT

618.5M parameters. 3.4 GB peak GPU memory. Model loads in 0.1s after first download.

Why This Exists

Most "streaming" ASR on Mac is either Whisper with overlapping windows reprocessing the same audio, or cloud APIs adding network latency. Nemotron's cache-aware conformer is architecturally different:

  • Each frame processed once — state carried forward in fixed-size ring buffers
  • Constant memory — no growing KV caches, no memory spikes on long recordings
  • Native Metal — no PyTorch, no ONNX, no bridge layers. Direct MLX on Apple GPU
  • 75x+ realtime — 13.4 hours of benchmarks in under 11 minutes
  • 2.79% WER on LibriSpeech test-clean

Architecture

FastConformer encoder (24 layers, 1024-dim) with 8x depthwise striding subsampling. RNNT decoder with 2-layer LSTM prediction network and joint network. Per-layer-group attention context windows for progressive causal restriction. Greedy decoding with blank suppression.

Based on Cache-aware Streaming Conformer and the NeMo toolkit.

Links

Author

Boris Djordjevic / 199 Biotechnologies / @longevityboris

License

Apache 2.0

Downloads last month
93
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dboris/nemotron-asr-mlx

Paper for dboris/nemotron-asr-mlx

Evaluation results