nemotron-asr-mlx / README.md
dboris's picture
Update WER numbers (test-other 5.52%, TED-LIUM 6.27%) and speed benchmarks
021db87 verified
metadata
license: apache-2.0
language:
  - en
library_name: mlx
tags:
  - speech
  - asr
  - speech-recognition
  - streaming
  - apple-silicon
  - mlx
  - nemotron
  - nvidia
  - conformer
  - rnnt
  - cache-aware
pipeline_tag: automatic-speech-recognition
datasets:
  - librispeech_asr
  - LIUM/tedlium
metrics:
  - wer
model-index:
  - name: nemotron-asr-mlx
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          type: librispeech_asr
          name: LibriSpeech (test-clean)
          split: test.clean
        metrics:
          - type: wer
            value: 2.7
            name: WER
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          type: librispeech_asr
          name: LibriSpeech (test-other)
          split: test.other
        metrics:
          - type: wer
            value: 5.52
            name: WER
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          type: LIUM/tedlium
          name: TED-LIUM v3
          split: test
        metrics:
          - type: wer
            value: 6.27
            name: WER

Nemotron ASR MLX

NVIDIA Nemotron ASR on Apple Silicon. 112x realtime. Pure MLX.

Native MLX port of NVIDIA's Nemotron-ASR 0.6B — the cache-aware streaming conformer that processes each audio frame exactly once. No sliding windows, no recomputation, no rewinding.

93 minutes of audio transcribed in under a minute on an M-series Mac.

Quick Start

pip install nemotron-asr-mlx
from nemotron_asr_mlx import from_pretrained

model = from_pretrained("dboris/nemotron-asr-mlx")
result = model.transcribe("meeting.wav")
print(result.text)

# Beam search + ILM subtraction for maximum accuracy
result = model.transcribe("meeting.wav", beam_size=4, ilm_scale=0.15)

Model downloads automatically on first run (~1.2 GB).

Official WER (Open ASR Leaderboard datasets)

Evaluated on the standard Open ASR Leaderboard datasets. Machine: Apple M4 Max, 16-core, 64 GB.

Dataset WER NVIDIA ref RTFx
LibriSpeech test-clean 2.70% 2.31% 112x
LibriSpeech test-other 5.57% 4.75% 74.8x
TED-LIUM v3 6.25% 4.50% 75.2x

NVIDIA reference numbers are from nemotron-asr-speech-streaming-en-0.6b at 1120ms chunk size (PyTorch, A100 GPU). Our MLX port runs in batch mode on Apple Silicon.

v0.2.0: mel frontend parity fixes + blank-frame skipping decoder reduced WER from 2.79% to 2.70% and increased speed from 76x to 112x realtime. Optional beam search decoding available via beam_size parameter.

Speed Benchmark

Measured on LibriSpeech test-clean, Apple M4 Max, 64 GB.

Content Duration Inference Speed
Short utterances (~3s each, 50 samples) 4.1 min 2.7s 90x RT
Sentences (~10s each, 50 samples) 9.7 min 4.6s 128x RT
Paragraphs (~30s each, 20 samples) 10.0 min 5.3s 113x RT
Long-form (14 min single file) 14.3 min 8.2s 105x RT
Full test-clean (2,620 samples) 5.4 hours 173s 112x RT

618.5M parameters. 3.4 GB peak GPU memory. Model loads in 0.1s after first download.

Why This Exists

Most "streaming" ASR on Mac is either Whisper with overlapping windows reprocessing the same audio, or cloud APIs adding network latency. Nemotron's cache-aware conformer is architecturally different:

  • Each frame processed once — state carried forward in fixed-size ring buffers
  • Constant memory — no growing KV caches, no memory spikes on long recordings
  • Native Metal — no PyTorch, no ONNX, no bridge layers. Direct MLX on Apple GPU
  • 75x+ realtime — 13.4 hours of benchmarks in under 11 minutes
  • 2.79% WER on LibriSpeech test-clean

Architecture

FastConformer encoder (24 layers, 1024-dim) with 8x depthwise striding subsampling. RNNT decoder with 2-layer LSTM prediction network and joint network. Per-layer-group attention context windows for progressive causal restriction. Greedy decoding with blank suppression.

Based on Cache-aware Streaming Conformer and the NeMo toolkit.

Links

Author

Boris Djordjevic / 199 Biotechnologies / @longevityboris

License

Apache 2.0