granite-speech-webgpu

Running

App Files Files Community

granite-speech-webgpu / README.md

gsaon

Upload 3 files

9e600a5 verified 2 months ago

preview code

raw

history blame contribute delete

3.67 kB

	---
	title: Granite Speech WebGPU
	emoji: 🎙️
	colorFrom: green
	colorTo: gray
	sdk: static
	app_file: index.html
	pinned: false
	---

	# Granite Speech WebGPU

	Browser-based speech recognition and translation using IBM Granite Speech 4.0 1B with [Transformers.js](https://huggingface.co/docs/transformers.js) and WebGPU acceleration.

	Your audio and transcription never leave your device.

	## Features

	- Speech-to-Text: Transcribe audio in multiple languages
	- Translation: Translate speech to English, French, German, Spanish, Portuguese, or Japanese
	- Voice Activity Detection: Silero VAD for automatic speech segmentation
	- Punctuation & Capitalization: Automatic post-processing (auto-detected language via tinyld)
	- Audio Input: Record from microphone or upload/drag-and-drop audio files
	- Real-time Sync: Transcript appears synchronized with audio playback
	- Streaming Output: Partial results displayed as tokens are generated
	- Fully Client-Side: All processing happens in your browser using WebGPU

	## Browser Requirements

	- Chrome 113+ or Edge 113+ (required for WebGPU)
	- Firefox and Safari do not yet have stable WebGPU support

	## Quick Start

	```bash
	git clone git@github.ibm.com:gsaon/granite-speech-webgpu.git
	cd granite-speech-webgpu
	python3 -m http.server 8080
	```

	Open http://localhost:8080. Models (~1.4 GB) are downloaded automatically from Hugging Face on first load and cached by the browser.

	For non-localhost access, use the HTTPS server:

	```bash
	python3 serve.py
	```

	## Architecture

	The app uses [Transformers.js v4](https://huggingface.co/docs/transformers.js) to run the full inference pipeline in ~30 lines:

	1. `AutoProcessor` handles audio preprocessing (mel spectrogram, frame stacking, normalization)
	2. `GraniteSpeechForConditionalGeneration` manages encoder, embeddings, and autoregressive decoding with KV-cache
	3. `TextStreamer` provides streaming token output

	### Models

	\| Component \| Source \| Size \| Purpose \|
	\|-----------\|--------\|------\|---------\|
	\| Granite Speech (q4f16) \| [onnx-community/granite-4.0-1b-speech-ONNX](https://huggingface.co/onnx-community/granite-4.0-1b-speech-ONNX) \| ~1.4 GB \| Speech recognition & translation \|
	\| Silero VAD \| Local \| 2.1 MB \| Voice activity detection \|
	\| Punctuation (EN) \| [1-800-BAD-CODE](https://huggingface.co/1-800-BAD-CODE/punctuation_fullstop_truecase_english) \| ~200 MB \| English punctuation & capitalization \|

	### Dependencies (loaded from CDN)

	- Transformers.js 4.0.0-next.7: Model loading, processing, and inference
	- ONNX Runtime Web 1.24.3: VAD and punctuation models (WASM)
	- tinyld: Language detection for automatic punctuation

	## Project Structure

	```
	granite-speech-webgpu/
	├── index.html # Main HTML page
	├── app.js # Main app (Transformers.js v4 inference + UI)
	├── vad.js # Silero VAD integration (ONNX/WASM)
	├── punctuator.js # Punctuation models (ONNX/WASM)
	├── style.css # Styling
	├── pcs_vocab.json # Punctuator vocabulary
	├── silero_vad.onnx # VAD model
	├── punct_cap_seg_en.onnx # English punctuator model
	└── serve.py # HTTPS development server
	```

	## Acknowledgments

	- [IBM Granite Speech](https://huggingface.co/ibm-granite/granite-4.0-1b-speech)
	- [Transformers.js](https://huggingface.co/docs/transformers.js)
	- [ONNX Community](https://huggingface.co/onnx-community)
	- [Silero VAD](https://github.com/snakers4/silero-vad)
	- [Punctuation Model](https://huggingface.co/1-800-BAD-CODE/punctuation_fullstop_truecase_english)
	- [tinyld](https://github.com/komodojp/tinyld)