title: Parakeet STT Progressive Transcription
emoji: π€
colorFrom: blue
colorTo: purple
sdk: static
pinned: false
custom_headers:
cross-origin-embedder-policy: credentialless
cross-origin-opener-policy: same-origin
cross-origin-resource-policy: cross-origin
Parakeet STT Progressive Transcription Demo
Real-time speech recognition with smart progressive streaming, powered by Parakeet TDT 0.6B v3 (ONNX) via parakeet.js and WebGPU acceleration.
Features
π€ Parakeet TDT 0.6B v3: NVIDIA's multilingual speech recognition model
- 25 European languages supported
- Word-level timestamps and confidence scores
- WebGPU accelerated inference
β‘ Smart Progressive Streaming: Intelligent window management with sentence-aware boundaries
- Growing window (0-15s) for accuracy
- Sentence-aware sliding window (>15s) to maintain context
- Real-time updates every 500ms
π Privacy-First: All processing happens locally in your browser - no data sent to servers
π¨ Visual Feedback:
- Yellow text: Fixed sentences (completed, won't change)
- Cyan text: Active transcription (in-progress)
π Developer Metrics: Real-time performance monitoring
- Latency and Real-time Factor (RTF)
- Window state visualization
- Memory usage tracking
- Confidence scores
Tech Stack
- Model: Parakeet TDT 0.6B v3 (ONNX)
- Inference: parakeet.js + ONNX Runtime Web
- Framework: React 18 + Vite
- Styling: Tailwind CSS
Usage
- Load Model: Click "Load Model" to download Parakeet (~2.5GB, one-time download)
- Start Recording: Click "Start Recording" and grant microphone permissions
- Speak: Watch real-time progressive transcriptions appear
- Stop Recording: Click "Stop Recording" to finalize the transcription
How It Works
Progressive Streaming Algorithm
This demo implements the smart progressive streaming algorithm from the speech-to-speech repository:
Growing Window (0-15s):
- Accumulates audio for better accuracy
- Re-transcribes entire buffer every 500ms
Sliding Window (>15s):
- Locks completed sentences as "fixed"
- Only re-transcribes active portion (last 2s)
- Prevents memory growth and maintains accuracy
Architecture
User Microphone
β
Web Audio API (16kHz)
β
Audio Processor (accumulate chunks)
β
Progressive Streaming Handler (500ms updates)
β
Web Worker β Parakeet ONNX Model (via parakeet.js + WebGPU)
β
Transcription Display (yellow fixed + cyan active)
Model Information
- Model: Parakeet TDT 0.6B v3
- Format: ONNX (optimized for web via parakeet.js)
- Size: ~2.5GB
- Languages: 25 European languages (EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, CS, SK, HU, RO, BG, HR, SL, SR, DA, NO, SV, FI, ET, LV, LT)
- Sample Rate: 16kHz
- Architecture: Conformer encoder + RNN-Transducer decoder
Browser Compatibility
| Browser | WebGPU Support | Status |
|---|---|---|
| Chrome 113+ | β Yes | Full support |
| Edge 113+ | β Yes | Full support |
| Firefox | β οΈ Limited | WASM fallback |
| Safari | β οΈ Limited | WASM fallback |
Performance
- First result: <500ms latency
- Progressive updates: 500ms cadence
- RTF (Real-time Factor): ~0.3-0.5x with WebGPU
- Model loading: 1-2 minutes (one-time, cached locally)
Note: Browser-based inference is inherently slower than native implementations. For comparison, the Python MLX implementation achieves ~60x faster performance on Apple Silicon. This is a fundamental limitation of running large models in browsers.
Credits
- Progressive Streaming Algorithm: speech-to-speech/STT/smart_progressive_streaming.py
- Parakeet.js: ysdede/parakeet.js
- ONNX Model: istupakov/parakeet-tdt-0.6b-v3-onnx
- Original Model: NVIDIA Parakeet TDT 0.6B v3
License
MIT