Parakeet Realtime EOU 120M — CoreML

CoreML conversion of [nvidia/parakeet-realtime-eou-120m-v1](https://huggingface.co/nvidia/parakee t-realtime-eou-120m-v1) for streaming speech recognition with end-of-utterance detection on Apple Silicon.

Used by FluidAudio for real-time transcription.

Models

The RNNT pipeline is split into three CoreML models, exported at two chunk sizes:

Model	Description
`streaming_encoder.mlmodelc`	FastConformer encoder with loopback state caching
`decoder.mlmodelc`	1-layer LSTM decoder (640 hidden units)
`joint_decision.mlmodelc`	Joint network for token prediction + EOU detection

Chunk Size Variants

Variant	Latency	WER (test-clean)	RTFx (M2)
`160ms/`	160ms	8.29%	4.78x
`320ms/`	320ms	4.87%	12.48x

Benchmarked on LibriSpeech test-clean (2620 files, 5.40h audio) on Apple M2.

Usage with FluidAudio

import FluidAudio

let manager = StreamingEouAsrManager()
await manager.initialize()

// Transcribe with EOU detection
await manager.startStreaming(
    eouCallback: { transcript in
        print("Utterance complete: \(transcript)")
    },
    partialCallback: { partial in
        print("Partial: \(partial)")
    }
)

// Feed audio chunks as they arrive
await manager.feedAudio(samples)

CLI

Transcribe a file

swift run fluidaudio parakeet-eou --input audio.wav

Benchmark

swift run -c release fluidaudio parakeet-eou --benchmark --chunk-size 320

Architecture

120M parameter RNNT (Recurrent Neural Network Transducer) with:

Encoder: 17-layer FastConformer with cache-aware streaming
Decoder: 1-layer LSTM, 640 hidden size
Joint: Linear projection with 1027 output classes (1024 tokens + EOU token + SOS + blank)
EOU token: ID 1024 signals end-of-utterance

Streaming State

The encoder maintains loopback state between chunks: ┌─────────────────────┬──────────────────┬───────────────────────┐ │ State │ Shape │ Description │ ├─────────────────────┼──────────────────┼───────────────────────┤ │ preCache │ [1, 128, N] │ Mel-level context │ ├─────────────────────┼──────────────────┼───────────────────────┤ │ cacheLastChannel │ [17, 1, 70, 512] │ Conformer layer cache │ ├─────────────────────┼──────────────────┼───────────────────────┤ │ cacheLastTime │ [17, 1, 512, 8] │ Temporal cache │ ├─────────────────────┼──────────────────┼───────────────────────┤ │ cacheLastChannelLen │ [1] │ Cache length tracking │ └─────────────────────┴──────────────────┴───────────────────────┘ Export

Converted from PyTorch using coremltools. To re-export:

python3 Scripts/ParakeetEOU/Conversion/convert_split_encoder.py
--output-dir Models/ParakeetEOU
--model-id nvidia/parakeet-realtime-eou-120m-v1

License

NVIDIA Open Model License — see https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/.

Original model: https://huggingface.co/nvidia/parakeet-realtime-eou-120m-v1

Downloads last month: 22,631

Model tree for FluidInference/parakeet-realtime-eou-120m-coreml

Base model

nvidia/parakeet_realtime_eou_120m-v1

Finetuned

(7)

this model

Collection including FluidInference/parakeet-realtime-eou-120m-coreml

CoreML

Collection

Models for Apple devices. See https://github.com/FluidInference/FluidAudio for usage details • 16 items • Updated 23 days ago • 6