metadata
title: Granite Speech WebGPU
emoji: ποΈ
colorFrom: green
colorTo: gray
sdk: static
app_file: index.html
pinned: false
Granite Speech WebGPU
Browser-based speech recognition and translation using IBM Granite Speech 4.0 1B with Transformers.js and WebGPU acceleration.
Your audio and transcription never leave your device.
Features
- Speech-to-Text: Transcribe audio in multiple languages
- Translation: Translate speech to English, French, German, Spanish, Portuguese, or Japanese
- Voice Activity Detection: Silero VAD for automatic speech segmentation
- Punctuation & Capitalization: Automatic post-processing (auto-detected language via tinyld)
- Audio Input: Record from microphone or upload/drag-and-drop audio files
- Real-time Sync: Transcript appears synchronized with audio playback
- Streaming Output: Partial results displayed as tokens are generated
- Fully Client-Side: All processing happens in your browser using WebGPU
Browser Requirements
- Chrome 113+ or Edge 113+ (required for WebGPU)
- Firefox and Safari do not yet have stable WebGPU support
Quick Start
git clone git@github.ibm.com:gsaon/granite-speech-webgpu.git
cd granite-speech-webgpu
python3 -m http.server 8080
Open http://localhost:8080. Models (~1.4 GB) are downloaded automatically from Hugging Face on first load and cached by the browser.
For non-localhost access, use the HTTPS server:
python3 serve.py
Architecture
The app uses Transformers.js v4 to run the full inference pipeline in ~30 lines:
AutoProcessorhandles audio preprocessing (mel spectrogram, frame stacking, normalization)GraniteSpeechForConditionalGenerationmanages encoder, embeddings, and autoregressive decoding with KV-cacheTextStreamerprovides streaming token output
Models
| Component | Source | Size | Purpose |
|---|---|---|---|
| Granite Speech (q4f16) | onnx-community/granite-4.0-1b-speech-ONNX | ~1.4 GB | Speech recognition & translation |
| Silero VAD | Local | 2.1 MB | Voice activity detection |
| Punctuation (EN) | 1-800-BAD-CODE | ~200 MB | English punctuation & capitalization |
Dependencies (loaded from CDN)
- Transformers.js 4.0.0-next.7: Model loading, processing, and inference
- ONNX Runtime Web 1.24.3: VAD and punctuation models (WASM)
- tinyld: Language detection for automatic punctuation
Project Structure
granite-speech-webgpu/
βββ index.html # Main HTML page
βββ app.js # Main app (Transformers.js v4 inference + UI)
βββ vad.js # Silero VAD integration (ONNX/WASM)
βββ punctuator.js # Punctuation models (ONNX/WASM)
βββ style.css # Styling
βββ pcs_vocab.json # Punctuator vocabulary
βββ silero_vad.onnx # VAD model
βββ punct_cap_seg_en.onnx # English punctuator model
βββ serve.py # HTTPS development server