--- license: other license_name: nvidia-open-model-license license_link: >- https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ language: - en metrics: - wer library_name: nemo tags: - speech-recognition - FastConformer - end-of-utterance - voice agent --- # Parakeet EOU Integration & Findings This directory contains the scripts and documentation for integrating the NVIDIA Parakeet Realtime EOU 120M model into FluidAudio. ## Executive Summary * **Goal:** Enable low-latency, streaming speech recognition with End-of-Utterance (EOU) detection on Apple Silicon. * **Result:** The "Authentic Streaming" mode of the `nvidia/parakeet-realtime-eou-120m-v1` model is **fundamentally broken** (produces garbage output). * **Solution:** We implemented a **"Short Batch" strategy**. We use the model's **Batch Encoder** (which works perfectly) with small, fixed-size input chunks (1.28s). This provides high accuracy (~40% WER) with streaming-like latency (~1.3s). ## Directory Structure * `Conversion/`: Scripts to export the PyTorch model to CoreML. * `convert_split_encoder.py`: **(Primary)** Exports the "Short Batch" model (1.28s chunks). * `convert_parakeet_eou.py`: Original export script. * `individual_components.py`: Shared model definitions. * `Inference/`: Scripts to test and verify the model in Python. * `test_full_pytorch_streaming.py`: **(Proof)** Demonstrates that the original PyTorch model fails in streaming mode. * `debug_nemo_streaming.py`: Debug script for streaming logic. ## The Journey & Findings ### 1. The Streaming Failure We initially attempted to use the model's native streaming encoder (`CacheAwareStreamingConfig`). * **Observation:** The model produced garbage output (e.g., "z", "znions", "arsith") regardless of the input audio. * **Investigation:** * We verified the CoreML export numerically against PyTorch (it matched). * We implemented audio buffering (NeMo-style) to fix edge artifacts. * We tested various chunk sizes (160ms, 320ms, 640ms, 1280ms). * **Root Cause:** We ran `test_full_pytorch_streaming.py` using the *original* NeMo library and model. It *also* produced garbage. This confirmed that the **model weights themselves** are likely untrained or incompatible with the streaming configuration exposed in the checkpoint. ### 2. The "Short Batch" Solution Since the **Batch Encoder** (FastConformer) works correctly (WER ~3-4% on clean audio), we pivoted to using it for pseudo-streaming. * **Method:** We re-exported the Batch Encoder to accept a fixed input size of **128 Mel frames (1.28 seconds)**. * **Implementation:** `BatchEouAsrManager.swift` accumulates audio, feeds 1.28s chunks to the encoder, and preserves the RNNT Decoder's state (LSTM hidden/cell states) between chunks to maintain context. * **Results:** * **Accuracy:** ~40% WER on `test-clean` (100 files). Much better than Streaming (76% WER), though lower than full-context Batch due to chunking. * **Latency:** ~1.3s (chunk size) + processing time. * **Performance:** ~23x Real-Time Factor (RTFx) on M2. ## Usage ### Swift (Production) Use `BatchEouAsrManager` for all transcription. ```swift let manager = BatchEouAsrManager() await manager.initialize() let result = try await manager.transcribe(audioSamples) ``` ### Benchmarking * **Short Batch (Working):** ```bash swift run -c release fluidaudio batch-eou-benchmark --subset test-clean --max-files 100 ``` * **Authentic Streaming (Broken - for demo only):** ```bash swift run -c release fluidaudio eou-benchmark --streaming --chunk-duration 160 ``` ## Model Export To re-export the Short Batch model: ```bash python3 Scripts/ParakeetEOU/Conversion/convert_split_encoder.py \ --output-dir Models/ParakeetEOU/ShortBatch \ --model-id nvidia/parakeet-realtime-eou-120m-v1 ```