Nemotron ASR MLX
NVIDIA Nemotron ASR on Apple Silicon. 112x realtime. Pure MLX.
Native MLX port of NVIDIA's Nemotron-ASR 0.6B — the cache-aware streaming conformer that processes each audio frame exactly once. No sliding windows, no recomputation, no rewinding.
93 minutes of audio transcribed in under a minute on an M-series Mac.
Quick Start
pip install nemotron-asr-mlx
from nemotron_asr_mlx import from_pretrained
model = from_pretrained("dboris/nemotron-asr-mlx")
result = model.transcribe("meeting.wav")
print(result.text)
# Beam search + ILM subtraction for maximum accuracy
result = model.transcribe("meeting.wav", beam_size=4, ilm_scale=0.15)
Model downloads automatically on first run (~1.2 GB).
Official WER (Open ASR Leaderboard datasets)
Evaluated on the standard Open ASR Leaderboard datasets. Machine: Apple M4 Max, 16-core, 64 GB.
| Dataset | WER | NVIDIA ref | RTFx |
|---|---|---|---|
| LibriSpeech test-clean | 2.70% | 2.31% | 112x |
| LibriSpeech test-other | 5.57% | 4.75% | 74.8x |
| TED-LIUM v3 | 6.25% | 4.50% | 75.2x |
NVIDIA reference numbers are from nemotron-asr-speech-streaming-en-0.6b at 1120ms chunk size (PyTorch, A100 GPU). Our MLX port runs in batch mode on Apple Silicon.
v0.2.0: mel frontend parity fixes + blank-frame skipping decoder reduced WER from 2.79% to 2.70% and increased speed from 76x to 112x realtime. Optional beam search decoding available via beam_size parameter.
Speed Benchmark
Measured on LibriSpeech test-clean, Apple M4 Max, 64 GB.
| Content | Duration | Inference | Speed |
|---|---|---|---|
| Short utterances (~3s each, 50 samples) | 4.1 min | 2.7s | 90x RT |
| Sentences (~10s each, 50 samples) | 9.7 min | 4.6s | 128x RT |
| Paragraphs (~30s each, 20 samples) | 10.0 min | 5.3s | 113x RT |
| Long-form (14 min single file) | 14.3 min | 8.2s | 105x RT |
| Full test-clean (2,620 samples) | 5.4 hours | 173s | 112x RT |
618.5M parameters. 3.4 GB peak GPU memory. Model loads in 0.1s after first download.
Why This Exists
Most "streaming" ASR on Mac is either Whisper with overlapping windows reprocessing the same audio, or cloud APIs adding network latency. Nemotron's cache-aware conformer is architecturally different:
- Each frame processed once — state carried forward in fixed-size ring buffers
- Constant memory — no growing KV caches, no memory spikes on long recordings
- Native Metal — no PyTorch, no ONNX, no bridge layers. Direct MLX on Apple GPU
- 75x+ realtime — 13.4 hours of benchmarks in under 11 minutes
- 2.79% WER on LibriSpeech test-clean
Architecture
FastConformer encoder (24 layers, 1024-dim) with 8x depthwise striding subsampling. RNNT decoder with 2-layer LSTM prediction network and joint network. Per-layer-group attention context windows for progressive causal restriction. Greedy decoding with blank suppression.
Based on Cache-aware Streaming Conformer and the NeMo toolkit.
Links
- GitHub: 199-biotechnologies/nemotron-asr-mlx
- PyPI: nemotron-asr-mlx
- Original model: nvidia/nemotron-asr-speech-streaming-en-0.6b
Author
Boris Djordjevic / 199 Biotechnologies / @longevityboris
License
Apache 2.0
- Downloads last month
- 93
Quantized
Dataset used to train dboris/nemotron-asr-mlx
Paper for dboris/nemotron-asr-mlx
Evaluation results
- WER on LibriSpeech (test-clean)self-reported2.700
- WER on LibriSpeech (test-other)self-reported5.520
- WER on TED-LIUM v3test set self-reported6.270