Nemotron Speech 600M β€” CoreML (Streaming)

Native CoreML conversion of nvidia/nemotron-speech-streaming-en-0.6b, a 600M-parameter streaming ASR model using FastConformer encoder + RNNT decoder. Optimized for Apple Neural Engine (ANE) on Apple Silicon.

Converted directly from the original NeMo checkpoint via coremltools, with 3-level numerical validation against the PyTorch reference.

Sibling project: danielbodart/nemotron-speech-600m-onnx β€” ONNX Runtime version for Linux (CUDA + CPU).

Performance

Metric Value
Encoder ANE utilization 93% (1486/1597 ops on Neural Engine)
Inference speed ~15.5x realtime on Apple Silicon
Encoder CPU ops 7% (softmax, attention masking β€” see config.json for breakdown)
Decoder 100% CPU (CPU_ONLY compute units)

No manual ANE optimizations applied β€” coremltools compiler routes ops automatically.

Available Precisions

Variant Encoder Decoder Total Compute Units Notes
fp16/ 1.1 GB 17 MB ~1.1 GB Encoder: CPU+ANE, Decoder: CPU Recommended

Future: INT8 quantization via coremltools (can halve encoder size on ANE).

Model Architecture

Two CoreML models (decoder and joint network are fused into one):

Model Input Output Compute
Encoder mel [1, 128, 65] + caches (FP32 in, FP16 out) encoded [1, 1024, 7] + caches CPU_AND_NE
Fused Decoder+Joint enc_frame [1, 1024, 1] + token [1, 1] + LSTM h,c (FP32 in, FP16 out) logits [1, 1025] + LSTM h,c CPU_ONLY

Mel spectrogram preprocessing runs on the host (not in CoreML).

Important: ANE stride padding

CoreML output MLMultiArrays may have non-contiguous strides due to ANE alignment padding. For example, the encoder output [1, 1024, 7] may have physical strides [32768, 32, 1] instead of C-contiguous [7168, 7, 1]. Callers must use stride-aware copy, not flat memcpy.

Runtime Configuration

All parameters needed to run the model are documented in config.json, including I/O specs, cache shapes, the streaming protocol, and ANE profiling results.

Audio Preprocessing

Parameter Value
Sample rate 16000 Hz
Sample format S16_LE (16-bit signed little-endian)
Pre-emphasis 0.97
FFT size 512
Hop length 160 samples (10ms)
Window length 400 samples (25ms)
Window type Hann
Mel bands 128
Mel norm Slaney
Mel layout Band-major [n_mels, n_frames] (not frame-major)

Encoder Streaming

Parameter Value
Chunk size 56 mel frames (560ms audio)
Pre-encode cache 9 mel frames (prepended from previous chunk)
Total input frames 65 per chunk (56 + 9)
Layers 24
Dimension 1024
cache_last_channel shape [1, 24, 70, 1024] FP32 in, FP16 out (init zeros)
cache_last_time shape [1, 24, 1024, 8] FP32 in, FP16 out (init zeros)
cache_last_channel_len [1] int32 (init zero)

Feed cache outputs back as next chunk's cache inputs. Convert FP16 outputs to FP32 before feeding back (model expects FP32 inputs).

RNNT Decoder

Parameter Value
Blank token ID 1024
Vocab size 1025 (1024 tokens + blank)
Max symbols per frame 10
Prediction layers 2 (LSTM)
Prediction hidden 640
input_states_1/2 shape [2, 1, 640] FP32 in, FP16 out (init zeros)

For each encoder output frame: feed single frame [1, 1024, 1] to decoder, argmax logits over 1025 vocab, if not blank emit token and loop (up to 10), if blank move to next frame. Feed decoder states back for next symbol/frame.

Files

config.json                       # Machine-readable runtime parameters + ANE profile

fp16/
β”œβ”€β”€ encoder.mlmodelc/             # Pre-compiled encoder (load directly with MLModel)
β”‚   β”œβ”€β”€ model.mil
β”‚   β”œβ”€β”€ coremldata.bin
β”‚   └── weights/weight.bin
β”œβ”€β”€ encoder.mlpackage/            # Source encoder (for runtime compilation fallback)
β”‚   └── Data/com.apple.CoreML/
β”‚       β”œβ”€β”€ model.mlmodel
β”‚       └── weights/weight.bin
β”œβ”€β”€ decoder.mlmodelc/             # Pre-compiled fused decoder+joint
β”œβ”€β”€ decoder.mlpackage/            # Source fused decoder+joint
└── metadata.json                 # Cache shapes, vocab size, model parameters

Usage

Download:

# Download compiled models (recommended)
hf download danielbodart/nemotron-speech-600m-coreml fp16/ config.json --local-dir ./model

# Download only .mlmodelc (skip .mlpackage to save space)
hf download danielbodart/nemotron-speech-600m-coreml fp16/encoder.mlmodelc/ fp16/decoder.mlmodelc/ fp16/metadata.json config.json --local-dir ./model

Load with CoreML (Swift):

import CoreML

let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine

let encoder = try MLModel(contentsOf: URL(fileURLWithPath: "model/fp16/encoder.mlmodelc"),
                          configuration: config)

let decConfig = MLModelConfiguration()
decConfig.computeUnits = .cpuOnly
let decoder = try MLModel(contentsOf: URL(fileURLWithPath: "model/fp16/decoder.mlmodelc"),
                          configuration: decConfig)

Load with CoreML (Python/coremltools):

import coremltools as ct

encoder = ct.models.MLModel("model/fp16/encoder.mlpackage")
decoder = ct.models.MLModel("model/fp16/decoder.mlpackage")

Conversion Reproducibility

All conversion and validation scripts are in the companion GitHub repo: danielbodart/nemotron-speech-600m-coreml

  • convert.py β€” NeMo β†’ CoreML conversion (wrap, trace, convert, compile)
  • validate.py β€” 3-level validation (wrapper equiv, CoreML vs PyTorch, end-to-end transcript)
  • wrappers.py β€” PyTorch wrappers (EncoderWrapper, FusedDecoderJointWrapper)

Requires macOS with Apple Silicon, Python 3.10, coremltools 9.0b1.

Related

License

The original model is licensed under CC-BY-4.0 by NVIDIA.

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for danielbodart/nemotron-speech-600m-coreml

Quantized
(8)
this model