gsaon's picture
Upload 3 files
9e600a5 verified
---
title: Granite Speech WebGPU
emoji: πŸŽ™οΈ
colorFrom: green
colorTo: gray
sdk: static
app_file: index.html
pinned: false
---
# Granite Speech WebGPU
Browser-based speech recognition and translation using IBM Granite Speech 4.0 1B with [Transformers.js](https://huggingface.co/docs/transformers.js) and WebGPU acceleration.
**Your audio and transcription never leave your device.**
## Features
- **Speech-to-Text**: Transcribe audio in multiple languages
- **Translation**: Translate speech to English, French, German, Spanish, Portuguese, or Japanese
- **Voice Activity Detection**: Silero VAD for automatic speech segmentation
- **Punctuation & Capitalization**: Automatic post-processing (auto-detected language via tinyld)
- **Audio Input**: Record from microphone or upload/drag-and-drop audio files
- **Real-time Sync**: Transcript appears synchronized with audio playback
- **Streaming Output**: Partial results displayed as tokens are generated
- **Fully Client-Side**: All processing happens in your browser using WebGPU
## Browser Requirements
- **Chrome 113+** or **Edge 113+** (required for WebGPU)
- Firefox and Safari do not yet have stable WebGPU support
## Quick Start
```bash
git clone git@github.ibm.com:gsaon/granite-speech-webgpu.git
cd granite-speech-webgpu
python3 -m http.server 8080
```
Open http://localhost:8080. Models (~1.4 GB) are downloaded automatically from Hugging Face on first load and cached by the browser.
For non-localhost access, use the HTTPS server:
```bash
python3 serve.py
```
## Architecture
The app uses [Transformers.js v4](https://huggingface.co/docs/transformers.js) to run the full inference pipeline in ~30 lines:
1. `AutoProcessor` handles audio preprocessing (mel spectrogram, frame stacking, normalization)
2. `GraniteSpeechForConditionalGeneration` manages encoder, embeddings, and autoregressive decoding with KV-cache
3. `TextStreamer` provides streaming token output
### Models
| Component | Source | Size | Purpose |
|-----------|--------|------|---------|
| Granite Speech (q4f16) | [onnx-community/granite-4.0-1b-speech-ONNX](https://huggingface.co/onnx-community/granite-4.0-1b-speech-ONNX) | ~1.4 GB | Speech recognition & translation |
| Silero VAD | Local | 2.1 MB | Voice activity detection |
| Punctuation (EN) | [1-800-BAD-CODE](https://huggingface.co/1-800-BAD-CODE/punctuation_fullstop_truecase_english) | ~200 MB | English punctuation & capitalization |
### Dependencies (loaded from CDN)
- **Transformers.js 4.0.0-next.7**: Model loading, processing, and inference
- **ONNX Runtime Web 1.24.3**: VAD and punctuation models (WASM)
- **tinyld**: Language detection for automatic punctuation
## Project Structure
```
granite-speech-webgpu/
β”œβ”€β”€ index.html # Main HTML page
β”œβ”€β”€ app.js # Main app (Transformers.js v4 inference + UI)
β”œβ”€β”€ vad.js # Silero VAD integration (ONNX/WASM)
β”œβ”€β”€ punctuator.js # Punctuation models (ONNX/WASM)
β”œβ”€β”€ style.css # Styling
β”œβ”€β”€ pcs_vocab.json # Punctuator vocabulary
β”œβ”€β”€ silero_vad.onnx # VAD model
β”œβ”€β”€ punct_cap_seg_en.onnx # English punctuator model
└── serve.py # HTTPS development server
```
## Acknowledgments
- [IBM Granite Speech](https://huggingface.co/ibm-granite/granite-4.0-1b-speech)
- [Transformers.js](https://huggingface.co/docs/transformers.js)
- [ONNX Community](https://huggingface.co/onnx-community)
- [Silero VAD](https://github.com/snakers4/silero-vad)
- [Punctuation Model](https://huggingface.co/1-800-BAD-CODE/punctuation_fullstop_truecase_english)
- [tinyld](https://github.com/komodojp/tinyld)