gsaon's picture
Upload 3 files
9e600a5 verified
metadata
title: Granite Speech WebGPU
emoji: πŸŽ™οΈ
colorFrom: green
colorTo: gray
sdk: static
app_file: index.html
pinned: false

Granite Speech WebGPU

Browser-based speech recognition and translation using IBM Granite Speech 4.0 1B with Transformers.js and WebGPU acceleration.

Your audio and transcription never leave your device.

Features

  • Speech-to-Text: Transcribe audio in multiple languages
  • Translation: Translate speech to English, French, German, Spanish, Portuguese, or Japanese
  • Voice Activity Detection: Silero VAD for automatic speech segmentation
  • Punctuation & Capitalization: Automatic post-processing (auto-detected language via tinyld)
  • Audio Input: Record from microphone or upload/drag-and-drop audio files
  • Real-time Sync: Transcript appears synchronized with audio playback
  • Streaming Output: Partial results displayed as tokens are generated
  • Fully Client-Side: All processing happens in your browser using WebGPU

Browser Requirements

  • Chrome 113+ or Edge 113+ (required for WebGPU)
  • Firefox and Safari do not yet have stable WebGPU support

Quick Start

git clone git@github.ibm.com:gsaon/granite-speech-webgpu.git
cd granite-speech-webgpu
python3 -m http.server 8080

Open http://localhost:8080. Models (~1.4 GB) are downloaded automatically from Hugging Face on first load and cached by the browser.

For non-localhost access, use the HTTPS server:

python3 serve.py

Architecture

The app uses Transformers.js v4 to run the full inference pipeline in ~30 lines:

  1. AutoProcessor handles audio preprocessing (mel spectrogram, frame stacking, normalization)
  2. GraniteSpeechForConditionalGeneration manages encoder, embeddings, and autoregressive decoding with KV-cache
  3. TextStreamer provides streaming token output

Models

Component Source Size Purpose
Granite Speech (q4f16) onnx-community/granite-4.0-1b-speech-ONNX ~1.4 GB Speech recognition & translation
Silero VAD Local 2.1 MB Voice activity detection
Punctuation (EN) 1-800-BAD-CODE ~200 MB English punctuation & capitalization

Dependencies (loaded from CDN)

  • Transformers.js 4.0.0-next.7: Model loading, processing, and inference
  • ONNX Runtime Web 1.24.3: VAD and punctuation models (WASM)
  • tinyld: Language detection for automatic punctuation

Project Structure

granite-speech-webgpu/
β”œβ”€β”€ index.html          # Main HTML page
β”œβ”€β”€ app.js              # Main app (Transformers.js v4 inference + UI)
β”œβ”€β”€ vad.js              # Silero VAD integration (ONNX/WASM)
β”œβ”€β”€ punctuator.js       # Punctuation models (ONNX/WASM)
β”œβ”€β”€ style.css           # Styling
β”œβ”€β”€ pcs_vocab.json      # Punctuator vocabulary
β”œβ”€β”€ silero_vad.onnx     # VAD model
β”œβ”€β”€ punct_cap_seg_en.onnx  # English punctuator model
└── serve.py            # HTTPS development server

Acknowledgments