Nemotron Speech 600M β CoreML (Streaming)
Native CoreML conversion of nvidia/nemotron-speech-streaming-en-0.6b, a 600M-parameter streaming ASR model using FastConformer encoder + RNNT decoder. Optimized for Apple Neural Engine (ANE) on Apple Silicon.
Converted directly from the original NeMo checkpoint via coremltools, with 3-level numerical validation against the PyTorch reference.
Sibling project: danielbodart/nemotron-speech-600m-onnx β ONNX Runtime version for Linux (CUDA + CPU).
Performance
| Metric | Value |
|---|---|
| Encoder ANE utilization | 93% (1486/1597 ops on Neural Engine) |
| Inference speed | ~15.5x realtime on Apple Silicon |
| Encoder CPU ops | 7% (softmax, attention masking β see config.json for breakdown) |
| Decoder | 100% CPU (CPU_ONLY compute units) |
No manual ANE optimizations applied β coremltools compiler routes ops automatically.
Available Precisions
| Variant | Encoder | Decoder | Total | Compute Units | Notes |
|---|---|---|---|---|---|
fp16/ |
1.1 GB | 17 MB | ~1.1 GB | Encoder: CPU+ANE, Decoder: CPU | Recommended |
Future: INT8 quantization via coremltools (can halve encoder size on ANE).
Model Architecture
Two CoreML models (decoder and joint network are fused into one):
| Model | Input | Output | Compute |
|---|---|---|---|
| Encoder | mel [1, 128, 65] + caches (FP32 in, FP16 out) |
encoded [1, 1024, 7] + caches |
CPU_AND_NE |
| Fused Decoder+Joint | enc_frame [1, 1024, 1] + token [1, 1] + LSTM h,c (FP32 in, FP16 out) |
logits [1, 1025] + LSTM h,c |
CPU_ONLY |
Mel spectrogram preprocessing runs on the host (not in CoreML).
Important: ANE stride padding
CoreML output MLMultiArrays may have non-contiguous strides due to ANE alignment padding. For example, the encoder output [1, 1024, 7] may have physical strides [32768, 32, 1] instead of C-contiguous [7168, 7, 1]. Callers must use stride-aware copy, not flat memcpy.
Runtime Configuration
All parameters needed to run the model are documented in config.json, including I/O specs, cache shapes, the streaming protocol, and ANE profiling results.
Audio Preprocessing
| Parameter | Value |
|---|---|
| Sample rate | 16000 Hz |
| Sample format | S16_LE (16-bit signed little-endian) |
| Pre-emphasis | 0.97 |
| FFT size | 512 |
| Hop length | 160 samples (10ms) |
| Window length | 400 samples (25ms) |
| Window type | Hann |
| Mel bands | 128 |
| Mel norm | Slaney |
| Mel layout | Band-major [n_mels, n_frames] (not frame-major) |
Encoder Streaming
| Parameter | Value |
|---|---|
| Chunk size | 56 mel frames (560ms audio) |
| Pre-encode cache | 9 mel frames (prepended from previous chunk) |
| Total input frames | 65 per chunk (56 + 9) |
| Layers | 24 |
| Dimension | 1024 |
cache_last_channel shape |
[1, 24, 70, 1024] FP32 in, FP16 out (init zeros) |
cache_last_time shape |
[1, 24, 1024, 8] FP32 in, FP16 out (init zeros) |
cache_last_channel_len |
[1] int32 (init zero) |
Feed cache outputs back as next chunk's cache inputs. Convert FP16 outputs to FP32 before feeding back (model expects FP32 inputs).
RNNT Decoder
| Parameter | Value |
|---|---|
| Blank token ID | 1024 |
| Vocab size | 1025 (1024 tokens + blank) |
| Max symbols per frame | 10 |
| Prediction layers | 2 (LSTM) |
| Prediction hidden | 640 |
input_states_1/2 shape |
[2, 1, 640] FP32 in, FP16 out (init zeros) |
For each encoder output frame: feed single frame [1, 1024, 1] to decoder, argmax logits over 1025 vocab, if not blank emit token and loop (up to 10), if blank move to next frame. Feed decoder states back for next symbol/frame.
Files
config.json # Machine-readable runtime parameters + ANE profile
fp16/
βββ encoder.mlmodelc/ # Pre-compiled encoder (load directly with MLModel)
β βββ model.mil
β βββ coremldata.bin
β βββ weights/weight.bin
βββ encoder.mlpackage/ # Source encoder (for runtime compilation fallback)
β βββ Data/com.apple.CoreML/
β βββ model.mlmodel
β βββ weights/weight.bin
βββ decoder.mlmodelc/ # Pre-compiled fused decoder+joint
βββ decoder.mlpackage/ # Source fused decoder+joint
βββ metadata.json # Cache shapes, vocab size, model parameters
Usage
Download:
# Download compiled models (recommended)
hf download danielbodart/nemotron-speech-600m-coreml fp16/ config.json --local-dir ./model
# Download only .mlmodelc (skip .mlpackage to save space)
hf download danielbodart/nemotron-speech-600m-coreml fp16/encoder.mlmodelc/ fp16/decoder.mlmodelc/ fp16/metadata.json config.json --local-dir ./model
Load with CoreML (Swift):
import CoreML
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let encoder = try MLModel(contentsOf: URL(fileURLWithPath: "model/fp16/encoder.mlmodelc"),
configuration: config)
let decConfig = MLModelConfiguration()
decConfig.computeUnits = .cpuOnly
let decoder = try MLModel(contentsOf: URL(fileURLWithPath: "model/fp16/decoder.mlmodelc"),
configuration: decConfig)
Load with CoreML (Python/coremltools):
import coremltools as ct
encoder = ct.models.MLModel("model/fp16/encoder.mlpackage")
decoder = ct.models.MLModel("model/fp16/decoder.mlpackage")
Conversion Reproducibility
All conversion and validation scripts are in the companion GitHub repo: danielbodart/nemotron-speech-600m-coreml
convert.pyβ NeMo β CoreML conversion (wrap, trace, convert, compile)validate.pyβ 3-level validation (wrapper equiv, CoreML vs PyTorch, end-to-end transcript)wrappers.pyβ PyTorch wrappers (EncoderWrapper, FusedDecoderJointWrapper)
Requires macOS with Apple Silicon, Python 3.10, coremltools 9.0b1.
Related
- danielbodart/nemotron-speech-600m-onnx β ONNX Runtime version for Linux (CUDA + CPU, FP16/INT8)
- nvidia/nemotron-speech-streaming-en-0.6b β Original NeMo model
License
The original model is licensed under CC-BY-4.0 by NVIDIA.
- Downloads last month
- 30
Model tree for danielbodart/nemotron-speech-600m-coreml
Base model
nvidia/nemotron-speech-streaming-en-0.6b