Nemotron ASR MLX

NVIDIA Nemotron ASR on Apple Silicon. 112x realtime. Pure MLX.

Native MLX port of NVIDIA's Nemotron-ASR 0.6B — the cache-aware streaming conformer that processes each audio frame exactly once. No sliding windows, no recomputation, no rewinding.

93 minutes of audio transcribed in under a minute on an M-series Mac.

Quick Start

pip install nemotron-asr-mlx

from nemotron_asr_mlx import from_pretrained

model = from_pretrained("dboris/nemotron-asr-mlx")
result = model.transcribe("meeting.wav")
print(result.text)

# Beam search + ILM subtraction for maximum accuracy
result = model.transcribe("meeting.wav", beam_size=4, ilm_scale=0.15)

Model downloads automatically on first run (~1.2 GB).

Official WER (Open ASR Leaderboard datasets)

Evaluated on the standard Open ASR Leaderboard datasets. Machine: Apple M4 Max, 16-core, 64 GB.

Dataset	WER	NVIDIA ref	RTFx
LibriSpeech test-clean	2.70%	2.31%	112x
LibriSpeech test-other	5.57%	4.75%	74.8x
TED-LIUM v3	6.25%	4.50%	75.2x

NVIDIA reference numbers are from nemotron-asr-speech-streaming-en-0.6b at 1120ms chunk size (PyTorch, A100 GPU). Our MLX port runs in batch mode on Apple Silicon.

v0.2.0: mel frontend parity fixes + blank-frame skipping decoder reduced WER from 2.79% to 2.70% and increased speed from 76x to 112x realtime. Optional beam search decoding available via beam_size parameter.

Speed Benchmark

Measured on LibriSpeech test-clean, Apple M4 Max, 64 GB.

Content	Duration	Inference	Speed
Short utterances (~3s each, 50 samples)	4.1 min	2.7s	90x RT
Sentences (~10s each, 50 samples)	9.7 min	4.6s	128x RT
Paragraphs (~30s each, 20 samples)	10.0 min	5.3s	113x RT
Long-form (14 min single file)	14.3 min	8.2s	105x RT
Full test-clean (2,620 samples)	5.4 hours	173s	112x RT

618.5M parameters. 3.4 GB peak GPU memory. Model loads in 0.1s after first download.

Why This Exists

Most "streaming" ASR on Mac is either Whisper with overlapping windows reprocessing the same audio, or cloud APIs adding network latency. Nemotron's cache-aware conformer is architecturally different:

Each frame processed once — state carried forward in fixed-size ring buffers
Constant memory — no growing KV caches, no memory spikes on long recordings
Native Metal — no PyTorch, no ONNX, no bridge layers. Direct MLX on Apple GPU
75x+ realtime — 13.4 hours of benchmarks in under 11 minutes
2.79% WER on LibriSpeech test-clean

Architecture

FastConformer encoder (24 layers, 1024-dim) with 8x depthwise striding subsampling. RNNT decoder with 2-layer LSTM prediction network and joint network. Per-layer-group attention context windows for progressive causal restriction. Greedy decoding with blank suppression.

Based on Cache-aware Streaming Conformer and the NeMo toolkit.

Author

Boris Djordjevic / 199 Biotechnologies / @longevityboris

License

Apache 2.0

Downloads last month: 176

Safetensors

Model size

0.6B params

Tensor type

F32

MLX

Hardware compatibility

Quantized

Dataset used to train dboris/nemotron-asr-mlx

Paper for dboris/nemotron-asr-mlx

Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

Paper • 2312.17279 • Published Dec 27, 2023 • 4

Evaluation results

WER on LibriSpeech (test-clean)
self-reported

2.700
WER on LibriSpeech (test-other)
self-reported

5.520
WER on TED-LIUM v3
test set self-reported

6.270

dboris
/

nemotron-asr-mlx