parakeet-realtime-eou-120m-coreml / README.md

Update README.md

2b961ab verified about 2 months ago

3.92 kB

	---
	license: other
	license_name: nvidia-open-model-license
	license_link: >-
	https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
	language:
	- en
	metrics:
	- wer
	library_name: nemo
	tags:
	- speech-recognition
	- FastConformer
	- end-of-utterance
	- voice agent
	---


	# Parakeet EOU Integration & Findings

	This directory contains the scripts and documentation for integrating the NVIDIA Parakeet Realtime EOU 120M model into FluidAudio.

	## Executive Summary

	* Goal: Enable low-latency, streaming speech recognition with End-of-Utterance (EOU) detection on Apple Silicon.
	* Result: The "Authentic Streaming" mode of the `nvidia/parakeet-realtime-eou-120m-v1` model is fundamentally broken (produces garbage output).
	* Solution: We implemented a "Short Batch" strategy. We use the model's Batch Encoder (which works perfectly) with small, fixed-size input chunks (1.28s). This provides high accuracy (~40% WER) with streaming-like latency (~1.3s).

	## Directory Structure

	* `Conversion/`: Scripts to export the PyTorch model to CoreML.
	* `convert_split_encoder.py`: (Primary) Exports the "Short Batch" model (1.28s chunks).
	* `convert_parakeet_eou.py`: Original export script.
	* `individual_components.py`: Shared model definitions.
	* `Inference/`: Scripts to test and verify the model in Python.
	* `test_full_pytorch_streaming.py`: (Proof) Demonstrates that the original PyTorch model fails in streaming mode.
	* `debug_nemo_streaming.py`: Debug script for streaming logic.

	## The Journey & Findings

	### 1. The Streaming Failure
	We initially attempted to use the model's native streaming encoder (`CacheAwareStreamingConfig`).
	* Observation: The model produced garbage output (e.g., "z", "znions", "arsith") regardless of the input audio.
	* Investigation:
	* We verified the CoreML export numerically against PyTorch (it matched).
	* We implemented audio buffering (NeMo-style) to fix edge artifacts.
	* We tested various chunk sizes (160ms, 320ms, 640ms, 1280ms).
	* Root Cause: We ran `test_full_pytorch_streaming.py` using the original NeMo library and model. It also produced garbage. This confirmed that the model weights themselves are likely untrained or incompatible with the streaming configuration exposed in the checkpoint.

	### 2. The "Short Batch" Solution
	Since the Batch Encoder (FastConformer) works correctly (WER ~3-4% on clean audio), we pivoted to using it for pseudo-streaming.
	* Method: We re-exported the Batch Encoder to accept a fixed input size of 128 Mel frames (1.28 seconds).
	* Implementation: `BatchEouAsrManager.swift` accumulates audio, feeds 1.28s chunks to the encoder, and preserves the RNNT Decoder's state (LSTM hidden/cell states) between chunks to maintain context.
	* Results:
	* Accuracy: ~40% WER on `test-clean` (100 files). Much better than Streaming (76% WER), though lower than full-context Batch due to chunking.
	* Latency: ~1.3s (chunk size) + processing time.
	* Performance: ~23x Real-Time Factor (RTFx) on M2.

	## Usage

	### Swift (Production)
	Use `BatchEouAsrManager` for all transcription.

	```swift
	let manager = BatchEouAsrManager()
	await manager.initialize()
	let result = try await manager.transcribe(audioSamples)
	```

	### Benchmarking
	* Short Batch (Working):
	```bash
	swift run -c release fluidaudio batch-eou-benchmark --subset test-clean --max-files 100
	```
	* Authentic Streaming (Broken - for demo only):
	```bash
	swift run -c release fluidaudio eou-benchmark --streaming --chunk-duration 160
	```

	## Model Export
	To re-export the Short Batch model:

	```bash
	python3 Scripts/ParakeetEOU/Conversion/convert_split_encoder.py \
	--output-dir Models/ParakeetEOU/ShortBatch \
	--model-id nvidia/parakeet-realtime-eou-120m-v1
	```