Nemotron Speech 600M — CoreML (Streaming)

Native CoreML conversion of nvidia/nemotron-speech-streaming-en-0.6b, a 600M-parameter streaming ASR model using FastConformer encoder + RNNT decoder. Optimized for Apple Neural Engine (ANE) on Apple Silicon.

Converted directly from the original NeMo checkpoint via coremltools, with 3-level numerical validation against the PyTorch reference.

Sibling project: danielbodart/nemotron-speech-600m-onnx — ONNX Runtime version for Linux (CUDA + CPU).

Performance

Metric	Value
Encoder ANE utilization	93% (1486/1597 ops on Neural Engine)
Inference speed	~15.5x realtime on Apple Silicon
Encoder CPU ops	7% (softmax, attention masking — see config.json for breakdown)
Decoder	100% CPU (CPU_ONLY compute units)

No manual ANE optimizations applied — coremltools compiler routes ops automatically.

Available Precisions

Variant	Encoder	Decoder	Total	Compute Units	Notes
`fp16/`	1.1 GB	17 MB	~1.1 GB	Encoder: CPU+ANE, Decoder: CPU	Recommended

Future: INT8 quantization via coremltools (can halve encoder size on ANE).

Model Architecture

Two CoreML models (decoder and joint network are fused into one):

Model	Input	Output	Compute
Encoder	mel `[1, 128, 65]` + caches (FP32 in, FP16 out)	encoded `[1, 1024, 7]` + caches	CPU_AND_NE
Fused Decoder+Joint	enc_frame `[1, 1024, 1]` + token `[1, 1]` + LSTM h,c (FP32 in, FP16 out)	logits `[1, 1025]` + LSTM h,c	CPU_ONLY

Mel spectrogram preprocessing runs on the host (not in CoreML).

Important: ANE stride padding

CoreML output MLMultiArrays may have non-contiguous strides due to ANE alignment padding. For example, the encoder output [1, 1024, 7] may have physical strides [32768, 32, 1] instead of C-contiguous [7168, 7, 1]. Callers must use stride-aware copy, not flat memcpy.

Runtime Configuration

All parameters needed to run the model are documented in config.json, including I/O specs, cache shapes, the streaming protocol, and ANE profiling results.

Audio Preprocessing

Parameter	Value
Sample rate	16000 Hz
Sample format	S16_LE (16-bit signed little-endian)
Pre-emphasis	0.97
FFT size	512
Hop length	160 samples (10ms)
Window length	400 samples (25ms)
Window type	Hann
Mel bands	128
Mel norm	Slaney
Mel layout	Band-major `[n_mels, n_frames]` (not frame-major)

Encoder Streaming

Parameter	Value
Chunk size	56 mel frames (560ms audio)
Pre-encode cache	9 mel frames (prepended from previous chunk)
Total input frames	65 per chunk (56 + 9)
Layers	24
Dimension	1024
`cache_last_channel` shape	`[1, 24, 70, 1024]` FP32 in, FP16 out (init zeros)
`cache_last_time` shape	`[1, 24, 1024, 8]` FP32 in, FP16 out (init zeros)
`cache_last_channel_len`	`[1]` int32 (init zero)

Feed cache outputs back as next chunk's cache inputs. Convert FP16 outputs to FP32 before feeding back (model expects FP32 inputs).

RNNT Decoder

Parameter	Value
Blank token ID	1024
Vocab size	1025 (1024 tokens + blank)
Max symbols per frame	10
Prediction layers	2 (LSTM)
Prediction hidden	640
`input_states_1/2` shape	`[2, 1, 640]` FP32 in, FP16 out (init zeros)

For each encoder output frame: feed single frame [1, 1024, 1] to decoder, argmax logits over 1025 vocab, if not blank emit token and loop (up to 10), if blank move to next frame. Feed decoder states back for next symbol/frame.

Files

config.json                       # Machine-readable runtime parameters + ANE profile

fp16/
├── encoder.mlmodelc/             # Pre-compiled encoder (load directly with MLModel)
│   ├── model.mil
│   ├── coremldata.bin
│   └── weights/weight.bin
├── encoder.mlpackage/            # Source encoder (for runtime compilation fallback)
│   └── Data/com.apple.CoreML/
│       ├── model.mlmodel
│       └── weights/weight.bin
├── decoder.mlmodelc/             # Pre-compiled fused decoder+joint
├── decoder.mlpackage/            # Source fused decoder+joint
└── metadata.json                 # Cache shapes, vocab size, model parameters

Usage

Download:

# Download compiled models (recommended)
hf download danielbodart/nemotron-speech-600m-coreml fp16/ config.json --local-dir ./model

# Download only .mlmodelc (skip .mlpackage to save space)
hf download danielbodart/nemotron-speech-600m-coreml fp16/encoder.mlmodelc/ fp16/decoder.mlmodelc/ fp16/metadata.json config.json --local-dir ./model

Load with CoreML (Swift):

import CoreML

let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine

let encoder = try MLModel(contentsOf: URL(fileURLWithPath: "model/fp16/encoder.mlmodelc"),
                          configuration: config)

let decConfig = MLModelConfiguration()
decConfig.computeUnits = .cpuOnly
let decoder = try MLModel(contentsOf: URL(fileURLWithPath: "model/fp16/decoder.mlmodelc"),
                          configuration: decConfig)

Load with CoreML (Python/coremltools):

import coremltools as ct

encoder = ct.models.MLModel("model/fp16/encoder.mlpackage")
decoder = ct.models.MLModel("model/fp16/decoder.mlpackage")

Conversion Reproducibility

All conversion and validation scripts are in the companion GitHub repo: danielbodart/nemotron-speech-600m-coreml

convert.py — NeMo → CoreML conversion (wrap, trace, convert, compile)
validate.py — 3-level validation (wrapper equiv, CoreML vs PyTorch, end-to-end transcript)
wrappers.py — PyTorch wrappers (EncoderWrapper, FusedDecoderJointWrapper)

Requires macOS with Apple Silicon, Python 3.10, coremltools 9.0b1.

danielbodart/nemotron-speech-600m-onnx — ONNX Runtime version for Linux (CUDA + CPU, FP16/INT8)
nvidia/nemotron-speech-streaming-en-0.6b — Original NeMo model

License

The original model is licensed under CC-BY-4.0 by NVIDIA.

Downloads last month: 8

Model tree for danielbodart/nemotron-speech-600m-coreml

Base model

nvidia/nemotron-speech-streaming-en-0.6b

Quantized

(19)

this model

danielbodart
/

nemotron-speech-600m-coreml