| --- |
| title: Parakeet STT Progressive Transcription |
| emoji: π€ |
| colorFrom: blue |
| colorTo: purple |
| sdk: static |
| pinned: false |
| custom_headers: |
| cross-origin-embedder-policy: credentialless |
| cross-origin-opener-policy: same-origin |
| cross-origin-resource-policy: cross-origin |
| --- |
| |
| # Parakeet STT Progressive Transcription Demo |
|
|
| Real-time speech recognition with smart progressive streaming, powered by **Parakeet TDT 0.6B v3** (ONNX) via [parakeet.js](https://github.com/ysdede/parakeet.js) and WebGPU acceleration. |
|
|
|
|
| ## Features |
|
|
| - **π€ Parakeet TDT 0.6B v3**: NVIDIA's multilingual speech recognition model |
| - 25 European languages supported |
| - Word-level timestamps and confidence scores |
| - WebGPU accelerated inference |
|
|
| - **β‘ Smart Progressive Streaming**: Intelligent window management with sentence-aware boundaries |
| - Growing window (0-15s) for accuracy |
| - Sentence-aware sliding window (>15s) to maintain context |
| - Real-time updates every 500ms |
|
|
| - **π Privacy-First**: All processing happens locally in your browser - no data sent to servers |
|
|
| - **π¨ Visual Feedback**: |
| - Yellow text: Fixed sentences (completed, won't change) |
| - Cyan text: Active transcription (in-progress) |
|
|
| - **π Developer Metrics**: Real-time performance monitoring |
| - Latency and Real-time Factor (RTF) |
| - Window state visualization |
| - Memory usage tracking |
| - Confidence scores |
|
|
| ## Tech Stack |
|
|
| - **Model**: [Parakeet TDT 0.6B v3 (ONNX)](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx) |
| - **Inference**: [parakeet.js](https://www.npmjs.com/package/parakeet.js) + [ONNX Runtime Web](https://onnxruntime.ai/docs/tutorials/web/) |
| - **Framework**: React 18 + Vite |
| - **Styling**: Tailwind CSS |
|
|
| ## Usage |
|
|
| 1. **Load Model**: Click "Load Model" to download Parakeet (~2.5GB, one-time download) |
| 2. **Start Recording**: Click "Start Recording" and grant microphone permissions |
| 3. **Speak**: Watch real-time progressive transcriptions appear |
| 4. **Stop Recording**: Click "Stop Recording" to finalize the transcription |
|
|
| ## How It Works |
|
|
| ### Progressive Streaming Algorithm |
|
|
| This demo implements the smart progressive streaming algorithm from the [speech-to-speech repository](https://github.com/huggingface/speech-to-speech): |
|
|
| 1. **Growing Window (0-15s)**: |
| - Accumulates audio for better accuracy |
| - Re-transcribes entire buffer every 500ms |
|
|
| 2. **Sliding Window (>15s)**: |
| - Locks completed sentences as "fixed" |
| - Only re-transcribes active portion (last 2s) |
| - Prevents memory growth and maintains accuracy |
|
|
| ### Architecture |
|
|
| ``` |
| User Microphone |
| β |
| Web Audio API (16kHz) |
| β |
| Audio Processor (accumulate chunks) |
| β |
| Progressive Streaming Handler (500ms updates) |
| β |
| Web Worker β Parakeet ONNX Model (via parakeet.js + WebGPU) |
| β |
| Transcription Display (yellow fixed + cyan active) |
| ``` |
|
|
| ## Model Information |
|
|
| - **Model**: Parakeet TDT 0.6B v3 |
| - **Format**: ONNX (optimized for web via parakeet.js) |
| - **Size**: ~2.5GB |
| - **Languages**: 25 European languages (EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, CS, SK, HU, RO, BG, HR, SL, SR, DA, NO, SV, FI, ET, LV, LT) |
| - **Sample Rate**: 16kHz |
| - **Architecture**: Conformer encoder + RNN-Transducer decoder |
|
|
| ## Browser Compatibility |
|
|
| | Browser | WebGPU Support | Status | |
| |---------|----------------|--------| |
| | Chrome 113+ | β
Yes | Full support | |
| | Edge 113+ | β
Yes | Full support | |
| | Firefox | β οΈ Limited | WASM fallback | |
| | Safari | β οΈ Limited | WASM fallback | |
|
|
| ## Performance |
|
|
| - **First result**: <500ms latency |
| - **Progressive updates**: 500ms cadence |
| - **RTF (Real-time Factor)**: ~0.3-0.5x with WebGPU |
| - **Model loading**: 1-2 minutes (one-time, cached locally) |
|
|
| **Note**: Browser-based inference is inherently slower than native implementations. For comparison, the Python MLX implementation achieves ~60x faster performance on Apple Silicon. This is a fundamental limitation of running large models in browsers. |
|
|
| ## Credits |
|
|
| - **Progressive Streaming Algorithm**: [speech-to-speech/STT/smart_progressive_streaming.py](https://github.com/huggingface/speech-to-speech/blob/main/STT/smart_progressive_streaming.py) |
| - **Parakeet.js**: [ysdede/parakeet.js](https://github.com/ysdede/parakeet.js) |
| - **ONNX Model**: [istupakov/parakeet-tdt-0.6b-v3-onnx](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx) |
| - **Original Model**: NVIDIA Parakeet TDT 0.6B v3 |
|
|
| ## License |
|
|
| MIT |
|
|
| ## References |
|
|
| - [Parakeet.js Documentation](https://github.com/ysdede/parakeet.js) |
| - [Parakeet.js Live Demo](https://huggingface.co/spaces/ysdede/parakeet.js-demo) |
| - [Original Python Implementation](https://github.com/huggingface/speech-to-speech) |
|
|