bweng's picture
Update README.md
2b961ab verified
---
license: other
license_name: nvidia-open-model-license
license_link: >-
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
language:
- en
metrics:
- wer
library_name: nemo
tags:
- speech-recognition
- FastConformer
- end-of-utterance
- voice agent
---
# Parakeet EOU Integration & Findings
This directory contains the scripts and documentation for integrating the NVIDIA Parakeet Realtime EOU 120M model into FluidAudio.
## Executive Summary
* **Goal:** Enable low-latency, streaming speech recognition with End-of-Utterance (EOU) detection on Apple Silicon.
* **Result:** The "Authentic Streaming" mode of the `nvidia/parakeet-realtime-eou-120m-v1` model is **fundamentally broken** (produces garbage output).
* **Solution:** We implemented a **"Short Batch" strategy**. We use the model's **Batch Encoder** (which works perfectly) with small, fixed-size input chunks (1.28s). This provides high accuracy (~40% WER) with streaming-like latency (~1.3s).
## Directory Structure
* `Conversion/`: Scripts to export the PyTorch model to CoreML.
* `convert_split_encoder.py`: **(Primary)** Exports the "Short Batch" model (1.28s chunks).
* `convert_parakeet_eou.py`: Original export script.
* `individual_components.py`: Shared model definitions.
* `Inference/`: Scripts to test and verify the model in Python.
* `test_full_pytorch_streaming.py`: **(Proof)** Demonstrates that the original PyTorch model fails in streaming mode.
* `debug_nemo_streaming.py`: Debug script for streaming logic.
## The Journey & Findings
### 1. The Streaming Failure
We initially attempted to use the model's native streaming encoder (`CacheAwareStreamingConfig`).
* **Observation:** The model produced garbage output (e.g., "z", "znions", "arsith") regardless of the input audio.
* **Investigation:**
* We verified the CoreML export numerically against PyTorch (it matched).
* We implemented audio buffering (NeMo-style) to fix edge artifacts.
* We tested various chunk sizes (160ms, 320ms, 640ms, 1280ms).
* **Root Cause:** We ran `test_full_pytorch_streaming.py` using the *original* NeMo library and model. It *also* produced garbage. This confirmed that the **model weights themselves** are likely untrained or incompatible with the streaming configuration exposed in the checkpoint.
### 2. The "Short Batch" Solution
Since the **Batch Encoder** (FastConformer) works correctly (WER ~3-4% on clean audio), we pivoted to using it for pseudo-streaming.
* **Method:** We re-exported the Batch Encoder to accept a fixed input size of **128 Mel frames (1.28 seconds)**.
* **Implementation:** `BatchEouAsrManager.swift` accumulates audio, feeds 1.28s chunks to the encoder, and preserves the RNNT Decoder's state (LSTM hidden/cell states) between chunks to maintain context.
* **Results:**
* **Accuracy:** ~40% WER on `test-clean` (100 files). Much better than Streaming (76% WER), though lower than full-context Batch due to chunking.
* **Latency:** ~1.3s (chunk size) + processing time.
* **Performance:** ~23x Real-Time Factor (RTFx) on M2.
## Usage
### Swift (Production)
Use `BatchEouAsrManager` for all transcription.
```swift
let manager = BatchEouAsrManager()
await manager.initialize()
let result = try await manager.transcribe(audioSamples)
```
### Benchmarking
* **Short Batch (Working):**
```bash
swift run -c release fluidaudio batch-eou-benchmark --subset test-clean --max-files 100
```
* **Authentic Streaming (Broken - for demo only):**
```bash
swift run -c release fluidaudio eou-benchmark --streaming --chunk-duration 160
```
## Model Export
To re-export the Short Batch model:
```bash
python3 Scripts/ParakeetEOU/Conversion/convert_split_encoder.py \
--output-dir Models/ParakeetEOU/ShortBatch \
--model-id nvidia/parakeet-realtime-eou-120m-v1
```