Nemotron Speech Streaming 0.6B - CoreML

CoreML conversion of NVIDIA's nvidia/nemotron-speech-streaming-en-0.6b for real-time streaming ASR on Apple devices.

Model Variants

Four chunk-size variants optimized for different latency/accuracy trade-offs:

Variant Chunk Duration Latency Use Case
nemotron_coreml_1120ms 1.12s High Best accuracy
nemotron_coreml_560ms 0.56s Medium Balanced
nemotron_coreml_160ms 0.16s Low Real-time feedback
nemotron_coreml_80ms 0.08s Ultra-low Experimental

All variants include:

  • Int8 quantized encoder (~564MB, 4x smaller than float32)
  • Compiled .mlmodelc format (ready for deployment)

Benchmark Results (LibriSpeech test-clean)

Tested on Apple M2 with FluidAudio:

Chunk Size WER RTFx Files
1120ms 1.99% 9.6x 100
560ms 2.12% 8.5x 100
160ms ~10% 3.5x 20
80ms ~60% 1.9x 20

160ms and 80ms were only tested on 20 files.

Model Overview

Property Value
Source Model nvidia/nemotron-speech-streaming-en-0.6b
Architecture FastConformer RNNT (Streaming)
Parameters 0.6B
Sample Rate 16kHz
Mel Features 128 bins
Quantization Int8 (encoder)

CoreML Models (per variant)

Model Size Function
preprocessor.mlmodelc ~1MB audio β†’ 128-dim mel spectrogram
encoder/encoder_int8.mlmodelc ~564MB mel + cache β†’ encoded + new_cache
decoder.mlmodelc ~28MB token + LSTM state β†’ decoder_out + new_state
joint.mlmodelc ~7MB encoder + decoder β†’ logits

Plus:

  • metadata.json - Model configuration (chunk size, mel frames, etc.)
  • tokenizer.json - Vocabulary (1024 tokens)

Directory Structure

nemotron-speech-streaming-en-0.6b-coreml/
β”œβ”€β”€ nemotron_coreml_1120ms/      # 1.12s chunks (best accuracy)
β”‚   β”œβ”€β”€ encoder/
β”‚   β”‚   └── encoder_int8.mlmodelc
β”‚   β”œβ”€β”€ preprocessor.mlmodelc
β”‚   β”œβ”€β”€ decoder.mlmodelc
β”‚   β”œβ”€β”€ joint.mlmodelc
β”‚   β”œβ”€β”€ metadata.json
β”‚   └── tokenizer.json
β”œβ”€β”€ nemotron_coreml_560ms/       # 0.56s chunks (balanced)
β”‚   └── ...
β”œβ”€β”€ nemotron_coreml_160ms/       # 0.16s chunks (low latency)
β”‚   └── ...
└── nemotron_coreml_80ms/        # 0.08s chunks (experimental)
    └── ...

Chunk Configuration

Each variant has different mel frame counts:

Variant chunk_mel_frames pre_encode_cache total_mel_frames
1120ms 112 9 121
560ms 56 9 65
160ms 16 9 25
80ms 8 9 17

Formula: chunk_ms = chunk_mel_frames Γ— 10ms

Cache Shapes

Cache Shape Description
cache_channel [1, 24, 70, 1024] Attention context cache
cache_time [1, 24, 1024, 8] Convolution time cache
cache_len [1] Cache fill level

Usage with FluidAudio

import FluidAudio

// Load with specific chunk size
let manager = NemotronStreamingAsrManager()
let modelDir = // path to nemotron_coreml_560ms
try await manager.loadModels(modelDir: modelDir)

// Process audio
let result = try await manager.process(audioBuffer: buffer)
let transcript = try await manager.finish()

CLI Benchmark

# Install FluidAudio CLI
git clone https://github.com/FluidInference/FluidAudio
cd FluidAudio

# Run benchmark with specific chunk size
swift run -c release fluidaudiocli nemotron-benchmark --chunk 560 --max-files 100

Inference Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     STREAMING RNNT PIPELINE                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. PREPROCESSOR (per audio chunk)
   audio [1, samples] β†’ mel [1, 128, chunk_mel_frames]

2. ENCODER (with cache)
   mel [1, 128, total_mel_frames] + cache β†’ encoded [1, 1024, T] + new_cache
   (total_mel_frames = pre_encode_cache + chunk_mel_frames)

3. DECODER + JOINT (greedy loop per encoder frame)
   For each encoder frame:
     token β†’ DECODER β†’ decoder_out
     encoder_step + decoder_out β†’ JOINT β†’ logits
     argmax β†’ predicted token
     if token == BLANK: next encoder frame
     else: emit token, update decoder state

Quantization Details

The encoder is quantized to int8 using CoreMLTools:

Metric Float32 Int8
Size ~2.2GB ~564MB
Compression 1x 3.9x
WER Impact Baseline Negligible

Other models (preprocessor, decoder, joint) remain in float32 as they are already small.

Notes

  • The encoder is the largest model with 24 Conformer layers
  • Model uses 128 mel bins (not the typical 80)
  • RNNT blank token index is 1024 (vocab_size)
  • Decoder uses 2-layer LSTM with 640 hidden units
  • Pre-encode cache (9 frames = 90ms) bridges chunk boundaries

License

Apache 2.0 (following NVIDIA's original license)

Downloads last month
46
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FluidInference/nemotron-speech-streaming-en-0.6b-coreml

Finetuned
(2)
this model

Collection including FluidInference/nemotron-speech-streaming-en-0.6b-coreml