|
|
--- |
|
|
license: other |
|
|
license_name: nvidia-open-model-license |
|
|
license_link: >- |
|
|
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- wer |
|
|
library_name: nemo |
|
|
tags: |
|
|
- speech-recognition |
|
|
- FastConformer |
|
|
- end-of-utterance |
|
|
- voice agent |
|
|
--- |
|
|
|
|
|
|
|
|
# Parakeet EOU Integration & Findings |
|
|
|
|
|
This directory contains the scripts and documentation for integrating the NVIDIA Parakeet Realtime EOU 120M model into FluidAudio. |
|
|
|
|
|
## Executive Summary |
|
|
|
|
|
* **Goal:** Enable low-latency, streaming speech recognition with End-of-Utterance (EOU) detection on Apple Silicon. |
|
|
* **Result:** The "Authentic Streaming" mode of the `nvidia/parakeet-realtime-eou-120m-v1` model is **fundamentally broken** (produces garbage output). |
|
|
* **Solution:** We implemented a **"Short Batch" strategy**. We use the model's **Batch Encoder** (which works perfectly) with small, fixed-size input chunks (1.28s). This provides high accuracy (~40% WER) with streaming-like latency (~1.3s). |
|
|
|
|
|
## Directory Structure |
|
|
|
|
|
* `Conversion/`: Scripts to export the PyTorch model to CoreML. |
|
|
* `convert_split_encoder.py`: **(Primary)** Exports the "Short Batch" model (1.28s chunks). |
|
|
* `convert_parakeet_eou.py`: Original export script. |
|
|
* `individual_components.py`: Shared model definitions. |
|
|
* `Inference/`: Scripts to test and verify the model in Python. |
|
|
* `test_full_pytorch_streaming.py`: **(Proof)** Demonstrates that the original PyTorch model fails in streaming mode. |
|
|
* `debug_nemo_streaming.py`: Debug script for streaming logic. |
|
|
|
|
|
## The Journey & Findings |
|
|
|
|
|
### 1. The Streaming Failure |
|
|
We initially attempted to use the model's native streaming encoder (`CacheAwareStreamingConfig`). |
|
|
* **Observation:** The model produced garbage output (e.g., "z", "znions", "arsith") regardless of the input audio. |
|
|
* **Investigation:** |
|
|
* We verified the CoreML export numerically against PyTorch (it matched). |
|
|
* We implemented audio buffering (NeMo-style) to fix edge artifacts. |
|
|
* We tested various chunk sizes (160ms, 320ms, 640ms, 1280ms). |
|
|
* **Root Cause:** We ran `test_full_pytorch_streaming.py` using the *original* NeMo library and model. It *also* produced garbage. This confirmed that the **model weights themselves** are likely untrained or incompatible with the streaming configuration exposed in the checkpoint. |
|
|
|
|
|
### 2. The "Short Batch" Solution |
|
|
Since the **Batch Encoder** (FastConformer) works correctly (WER ~3-4% on clean audio), we pivoted to using it for pseudo-streaming. |
|
|
* **Method:** We re-exported the Batch Encoder to accept a fixed input size of **128 Mel frames (1.28 seconds)**. |
|
|
* **Implementation:** `BatchEouAsrManager.swift` accumulates audio, feeds 1.28s chunks to the encoder, and preserves the RNNT Decoder's state (LSTM hidden/cell states) between chunks to maintain context. |
|
|
* **Results:** |
|
|
* **Accuracy:** ~40% WER on `test-clean` (100 files). Much better than Streaming (76% WER), though lower than full-context Batch due to chunking. |
|
|
* **Latency:** ~1.3s (chunk size) + processing time. |
|
|
* **Performance:** ~23x Real-Time Factor (RTFx) on M2. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Swift (Production) |
|
|
Use `BatchEouAsrManager` for all transcription. |
|
|
|
|
|
```swift |
|
|
let manager = BatchEouAsrManager() |
|
|
await manager.initialize() |
|
|
let result = try await manager.transcribe(audioSamples) |
|
|
``` |
|
|
|
|
|
### Benchmarking |
|
|
* **Short Batch (Working):** |
|
|
```bash |
|
|
swift run -c release fluidaudio batch-eou-benchmark --subset test-clean --max-files 100 |
|
|
``` |
|
|
* **Authentic Streaming (Broken - for demo only):** |
|
|
```bash |
|
|
swift run -c release fluidaudio eou-benchmark --streaming --chunk-duration 160 |
|
|
``` |
|
|
|
|
|
## Model Export |
|
|
To re-export the Short Batch model: |
|
|
|
|
|
```bash |
|
|
python3 Scripts/ParakeetEOU/Conversion/convert_split_encoder.py \ |
|
|
--output-dir Models/ParakeetEOU/ShortBatch \ |
|
|
--model-id nvidia/parakeet-realtime-eou-120m-v1 |
|
|
``` |
|
|
|