CoreML
Collection
Models for Apple devices. See https://github.com/FluidInference/FluidAudio for usage details
β’
9 items
β’
Updated
β’
5
CoreML conversion of NVIDIA's nvidia/nemotron-speech-streaming-en-0.6b for real-time streaming ASR on Apple devices.
Four chunk-size variants optimized for different latency/accuracy trade-offs:
| Variant | Chunk Duration | Latency | Use Case |
|---|---|---|---|
nemotron_coreml_1120ms |
1.12s | High | Best accuracy |
nemotron_coreml_560ms |
0.56s | Medium | Balanced |
nemotron_coreml_160ms |
0.16s | Low | Real-time feedback |
nemotron_coreml_80ms |
0.08s | Ultra-low | Experimental |
All variants include:
Tested on Apple M2 with FluidAudio:
| Chunk Size | WER | RTFx | Files |
|---|---|---|---|
| 1120ms | 1.99% | 9.6x | 100 |
| 560ms | 2.12% | 8.5x | 100 |
| 160ms | ~10% | 3.5x | 20 |
| 80ms | ~60% | 1.9x | 20 |
160ms and 80ms were only tested on 20 files.
| Property | Value |
|---|---|
| Source Model | nvidia/nemotron-speech-streaming-en-0.6b |
| Architecture | FastConformer RNNT (Streaming) |
| Parameters | 0.6B |
| Sample Rate | 16kHz |
| Mel Features | 128 bins |
| Quantization | Int8 (encoder) |
| Model | Size | Function |
|---|---|---|
preprocessor.mlmodelc |
~1MB | audio β 128-dim mel spectrogram |
encoder/encoder_int8.mlmodelc |
~564MB | mel + cache β encoded + new_cache |
decoder.mlmodelc |
~28MB | token + LSTM state β decoder_out + new_state |
joint.mlmodelc |
~7MB | encoder + decoder β logits |
Plus:
metadata.json - Model configuration (chunk size, mel frames, etc.)tokenizer.json - Vocabulary (1024 tokens)nemotron-speech-streaming-en-0.6b-coreml/
βββ nemotron_coreml_1120ms/ # 1.12s chunks (best accuracy)
β βββ encoder/
β β βββ encoder_int8.mlmodelc
β βββ preprocessor.mlmodelc
β βββ decoder.mlmodelc
β βββ joint.mlmodelc
β βββ metadata.json
β βββ tokenizer.json
βββ nemotron_coreml_560ms/ # 0.56s chunks (balanced)
β βββ ...
βββ nemotron_coreml_160ms/ # 0.16s chunks (low latency)
β βββ ...
βββ nemotron_coreml_80ms/ # 0.08s chunks (experimental)
βββ ...
Each variant has different mel frame counts:
| Variant | chunk_mel_frames | pre_encode_cache | total_mel_frames |
|---|---|---|---|
| 1120ms | 112 | 9 | 121 |
| 560ms | 56 | 9 | 65 |
| 160ms | 16 | 9 | 25 |
| 80ms | 8 | 9 | 17 |
Formula: chunk_ms = chunk_mel_frames Γ 10ms
| Cache | Shape | Description |
|---|---|---|
| cache_channel | [1, 24, 70, 1024] | Attention context cache |
| cache_time | [1, 24, 1024, 8] | Convolution time cache |
| cache_len | [1] | Cache fill level |
import FluidAudio
// Load with specific chunk size
let manager = NemotronStreamingAsrManager()
let modelDir = // path to nemotron_coreml_560ms
try await manager.loadModels(modelDir: modelDir)
// Process audio
let result = try await manager.process(audioBuffer: buffer)
let transcript = try await manager.finish()
# Install FluidAudio CLI
git clone https://github.com/FluidInference/FluidAudio
cd FluidAudio
# Run benchmark with specific chunk size
swift run -c release fluidaudiocli nemotron-benchmark --chunk 560 --max-files 100
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STREAMING RNNT PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. PREPROCESSOR (per audio chunk)
audio [1, samples] β mel [1, 128, chunk_mel_frames]
2. ENCODER (with cache)
mel [1, 128, total_mel_frames] + cache β encoded [1, 1024, T] + new_cache
(total_mel_frames = pre_encode_cache + chunk_mel_frames)
3. DECODER + JOINT (greedy loop per encoder frame)
For each encoder frame:
token β DECODER β decoder_out
encoder_step + decoder_out β JOINT β logits
argmax β predicted token
if token == BLANK: next encoder frame
else: emit token, update decoder state
The encoder is quantized to int8 using CoreMLTools:
| Metric | Float32 | Int8 |
|---|---|---|
| Size | ~2.2GB | ~564MB |
| Compression | 1x | 3.9x |
| WER Impact | Baseline | Negligible |
Other models (preprocessor, decoder, joint) remain in float32 as they are already small.
Apache 2.0 (following NVIDIA's original license)
Base model
nvidia/nemotron-speech-streaming-en-0.6b