andito's picture
andito HF Staff
Update README: correct progressive update interval to 500ms
8718725
metadata
title: Parakeet STT Progressive Transcription
emoji: 🎀
colorFrom: blue
colorTo: purple
sdk: static
pinned: false
custom_headers:
  cross-origin-embedder-policy: credentialless
  cross-origin-opener-policy: same-origin
  cross-origin-resource-policy: cross-origin

Parakeet STT Progressive Transcription Demo

Real-time speech recognition with smart progressive streaming, powered by Parakeet TDT 0.6B v3 (ONNX) via parakeet.js and WebGPU acceleration.

Features

  • 🎀 Parakeet TDT 0.6B v3: NVIDIA's multilingual speech recognition model

    • 25 European languages supported
    • Word-level timestamps and confidence scores
    • WebGPU accelerated inference
  • ⚑ Smart Progressive Streaming: Intelligent window management with sentence-aware boundaries

    • Growing window (0-15s) for accuracy
    • Sentence-aware sliding window (>15s) to maintain context
    • Real-time updates every 500ms
  • πŸ”’ Privacy-First: All processing happens locally in your browser - no data sent to servers

  • 🎨 Visual Feedback:

    • Yellow text: Fixed sentences (completed, won't change)
    • Cyan text: Active transcription (in-progress)
  • πŸ“Š Developer Metrics: Real-time performance monitoring

    • Latency and Real-time Factor (RTF)
    • Window state visualization
    • Memory usage tracking
    • Confidence scores

Tech Stack

Usage

  1. Load Model: Click "Load Model" to download Parakeet (~2.5GB, one-time download)
  2. Start Recording: Click "Start Recording" and grant microphone permissions
  3. Speak: Watch real-time progressive transcriptions appear
  4. Stop Recording: Click "Stop Recording" to finalize the transcription

How It Works

Progressive Streaming Algorithm

This demo implements the smart progressive streaming algorithm from the speech-to-speech repository:

  1. Growing Window (0-15s):

    • Accumulates audio for better accuracy
    • Re-transcribes entire buffer every 500ms
  2. Sliding Window (>15s):

    • Locks completed sentences as "fixed"
    • Only re-transcribes active portion (last 2s)
    • Prevents memory growth and maintains accuracy

Architecture

User Microphone
     ↓
Web Audio API (16kHz)
     ↓
Audio Processor (accumulate chunks)
     ↓
Progressive Streaming Handler (500ms updates)
     ↓
Web Worker β†’ Parakeet ONNX Model (via parakeet.js + WebGPU)
     ↓
Transcription Display (yellow fixed + cyan active)

Model Information

  • Model: Parakeet TDT 0.6B v3
  • Format: ONNX (optimized for web via parakeet.js)
  • Size: ~2.5GB
  • Languages: 25 European languages (EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, CS, SK, HU, RO, BG, HR, SL, SR, DA, NO, SV, FI, ET, LV, LT)
  • Sample Rate: 16kHz
  • Architecture: Conformer encoder + RNN-Transducer decoder

Browser Compatibility

Browser WebGPU Support Status
Chrome 113+ βœ… Yes Full support
Edge 113+ βœ… Yes Full support
Firefox ⚠️ Limited WASM fallback
Safari ⚠️ Limited WASM fallback

Performance

  • First result: <500ms latency
  • Progressive updates: 500ms cadence
  • RTF (Real-time Factor): ~0.3-0.5x with WebGPU
  • Model loading: 1-2 minutes (one-time, cached locally)

Note: Browser-based inference is inherently slower than native implementations. For comparison, the Python MLX implementation achieves ~60x faster performance on Apple Silicon. This is a fundamental limitation of running large models in browsers.

Credits

License

MIT

References