File size: 3,961 Bytes

---
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
language:
- en
metrics:
- wer
library_name: nemo
tags:
- speech-recognition
- FastConformer
- end-of-utterance
- voice agent
pipeline_tag: automatic-speech-recognition
---


# Parakeet EOU Integration & Findings

This directory contains the scripts and documentation for integrating the NVIDIA Parakeet Realtime EOU 120M model into FluidAudio.

## Executive Summary

*   **Goal:** Enable low-latency, streaming speech recognition with End-of-Utterance (EOU) detection on Apple Silicon.
*   **Result:** The "Authentic Streaming" mode of the `nvidia/parakeet-realtime-eou-120m-v1` model is **fundamentally broken** (produces garbage output).
*   **Solution:** We implemented a **"Short Batch" strategy**. We use the model's **Batch Encoder** (which works perfectly) with small, fixed-size input chunks (1.28s). This provides high accuracy (~40% WER) with streaming-like latency (~1.3s).

## Directory Structure

*   `Conversion/`: Scripts to export the PyTorch model to CoreML.
    *   `convert_split_encoder.py`: **(Primary)** Exports the "Short Batch" model (1.28s chunks).
    *   `convert_parakeet_eou.py`: Original export script.
    *   `individual_components.py`: Shared model definitions.
*   `Inference/`: Scripts to test and verify the model in Python.
    *   `test_full_pytorch_streaming.py`: **(Proof)** Demonstrates that the original PyTorch model fails in streaming mode.
    *   `debug_nemo_streaming.py`: Debug script for streaming logic.

## The Journey & Findings

### 1. The Streaming Failure
We initially attempted to use the model's native streaming encoder (`CacheAwareStreamingConfig`).
*   **Observation:** The model produced garbage output (e.g., "z", "znions", "arsith") regardless of the input audio.
*   **Investigation:**
    *   We verified the CoreML export numerically against PyTorch (it matched).
    *   We implemented audio buffering (NeMo-style) to fix edge artifacts.
    *   We tested various chunk sizes (160ms, 320ms, 640ms, 1280ms).
*   **Root Cause:** We ran `test_full_pytorch_streaming.py` using the *original* NeMo library and model. It *also* produced garbage. This confirmed that the **model weights themselves** are likely untrained or incompatible with the streaming configuration exposed in the checkpoint.

### 2. The "Short Batch" Solution
Since the **Batch Encoder** (FastConformer) works correctly (WER ~3-4% on clean audio), we pivoted to using it for pseudo-streaming.
*   **Method:** We re-exported the Batch Encoder to accept a fixed input size of **128 Mel frames (1.28 seconds)**.
*   **Implementation:** `BatchEouAsrManager.swift` accumulates audio, feeds 1.28s chunks to the encoder, and preserves the RNNT Decoder's state (LSTM hidden/cell states) between chunks to maintain context.
*   **Results:**
    *   **Accuracy:** ~40% WER on `test-clean` (100 files). Much better than Streaming (76% WER), though lower than full-context Batch due to chunking.
    *   **Latency:** ~1.3s (chunk size) + processing time.
    *   **Performance:** ~23x Real-Time Factor (RTFx) on M2.

## Usage

### Swift (Production)
Use `BatchEouAsrManager` for all transcription.

```swift
let manager = BatchEouAsrManager()
await manager.initialize()
let result = try await manager.transcribe(audioSamples)
```

### Benchmarking
*   **Short Batch (Working):**
    ```bash
    swift run -c release fluidaudio batch-eou-benchmark --subset test-clean --max-files 100
    ```
*   **Authentic Streaming (Broken - for demo only):**
    ```bash
    swift run -c release fluidaudio eou-benchmark --streaming --chunk-duration 160
    ```

## Model Export
To re-export the Short Batch model:

```bash
python3 Scripts/ParakeetEOU/Conversion/convert_split_encoder.py \
  --output-dir Models/ParakeetEOU/ShortBatch \
  --model-id nvidia/parakeet-realtime-eou-120m-v1
```