Spaces:
Running
Running
| title: Granite Speech WebGPU | |
| emoji: ποΈ | |
| colorFrom: green | |
| colorTo: gray | |
| sdk: static | |
| app_file: index.html | |
| pinned: false | |
| # Granite Speech WebGPU | |
| Browser-based speech recognition and translation using IBM Granite Speech 4.0 1B with [Transformers.js](https://huggingface.co/docs/transformers.js) and WebGPU acceleration. | |
| **Your audio and transcription never leave your device.** | |
| ## Features | |
| - **Speech-to-Text**: Transcribe audio in multiple languages | |
| - **Translation**: Translate speech to English, French, German, Spanish, Portuguese, or Japanese | |
| - **Voice Activity Detection**: Silero VAD for automatic speech segmentation | |
| - **Punctuation & Capitalization**: Automatic post-processing (auto-detected language via tinyld) | |
| - **Audio Input**: Record from microphone or upload/drag-and-drop audio files | |
| - **Real-time Sync**: Transcript appears synchronized with audio playback | |
| - **Streaming Output**: Partial results displayed as tokens are generated | |
| - **Fully Client-Side**: All processing happens in your browser using WebGPU | |
| ## Browser Requirements | |
| - **Chrome 113+** or **Edge 113+** (required for WebGPU) | |
| - Firefox and Safari do not yet have stable WebGPU support | |
| ## Quick Start | |
| ```bash | |
| git clone git@github.ibm.com:gsaon/granite-speech-webgpu.git | |
| cd granite-speech-webgpu | |
| python3 -m http.server 8080 | |
| ``` | |
| Open http://localhost:8080. Models (~1.4 GB) are downloaded automatically from Hugging Face on first load and cached by the browser. | |
| For non-localhost access, use the HTTPS server: | |
| ```bash | |
| python3 serve.py | |
| ``` | |
| ## Architecture | |
| The app uses [Transformers.js v4](https://huggingface.co/docs/transformers.js) to run the full inference pipeline in ~30 lines: | |
| 1. `AutoProcessor` handles audio preprocessing (mel spectrogram, frame stacking, normalization) | |
| 2. `GraniteSpeechForConditionalGeneration` manages encoder, embeddings, and autoregressive decoding with KV-cache | |
| 3. `TextStreamer` provides streaming token output | |
| ### Models | |
| | Component | Source | Size | Purpose | | |
| |-----------|--------|------|---------| | |
| | Granite Speech (q4f16) | [onnx-community/granite-4.0-1b-speech-ONNX](https://huggingface.co/onnx-community/granite-4.0-1b-speech-ONNX) | ~1.4 GB | Speech recognition & translation | | |
| | Silero VAD | Local | 2.1 MB | Voice activity detection | | |
| | Punctuation (EN) | [1-800-BAD-CODE](https://huggingface.co/1-800-BAD-CODE/punctuation_fullstop_truecase_english) | ~200 MB | English punctuation & capitalization | | |
| ### Dependencies (loaded from CDN) | |
| - **Transformers.js 4.0.0-next.7**: Model loading, processing, and inference | |
| - **ONNX Runtime Web 1.24.3**: VAD and punctuation models (WASM) | |
| - **tinyld**: Language detection for automatic punctuation | |
| ## Project Structure | |
| ``` | |
| granite-speech-webgpu/ | |
| βββ index.html # Main HTML page | |
| βββ app.js # Main app (Transformers.js v4 inference + UI) | |
| βββ vad.js # Silero VAD integration (ONNX/WASM) | |
| βββ punctuator.js # Punctuation models (ONNX/WASM) | |
| βββ style.css # Styling | |
| βββ pcs_vocab.json # Punctuator vocabulary | |
| βββ silero_vad.onnx # VAD model | |
| βββ punct_cap_seg_en.onnx # English punctuator model | |
| βββ serve.py # HTTPS development server | |
| ``` | |
| ## Acknowledgments | |
| - [IBM Granite Speech](https://huggingface.co/ibm-granite/granite-4.0-1b-speech) | |
| - [Transformers.js](https://huggingface.co/docs/transformers.js) | |
| - [ONNX Community](https://huggingface.co/onnx-community) | |
| - [Silero VAD](https://github.com/snakers4/silero-vad) | |
| - [Punctuation Model](https://huggingface.co/1-800-BAD-CODE/punctuation_fullstop_truecase_english) | |
| - [tinyld](https://github.com/komodojp/tinyld) | |