Spaces:

ibm-granite
/

granite-speech-webgpu

Running

App Files Files Community

gsaon commited on Mar 16

Commit

9e600a5

verified ·

1 Parent(s): dbc1132

Upload 3 files

Browse files

Files changed (3) hide show

README.md +88 -3
app.js +90 -500
index.html +3 -2

README.md CHANGED Viewed

@@ -1,8 +1,93 @@
 ---
 title: Granite Speech WebGPU
-emoji: 🗣️
-colorFrom: blue
-colorTo: indigo
 sdk: static
 pinned: false
 ---

 ---
 title: Granite Speech WebGPU
+emoji: 🎙️
+colorFrom: green
+colorTo: gray
 sdk: static
+app_file: index.html
 pinned: false
 ---
+# Granite Speech WebGPU
+Browser-based speech recognition and translation using IBM Granite Speech 4.0 1B with [Transformers.js](https://huggingface.co/docs/transformers.js) and WebGPU acceleration.
+**Your audio and transcription never leave your device.**
+## Features
+- **Speech-to-Text**: Transcribe audio in multiple languages
+- **Translation**: Translate speech to English, French, German, Spanish, Portuguese, or Japanese
+- **Voice Activity Detection**: Silero VAD for automatic speech segmentation
+- **Punctuation & Capitalization**: Automatic post-processing (auto-detected language via tinyld)
+- **Audio Input**: Record from microphone or upload/drag-and-drop audio files
+- **Real-time Sync**: Transcript appears synchronized with audio playback
+- **Streaming Output**: Partial results displayed as tokens are generated
+- **Fully Client-Side**: All processing happens in your browser using WebGPU
+## Browser Requirements
+- **Chrome 113+** or **Edge 113+** (required for WebGPU)
+- Firefox and Safari do not yet have stable WebGPU support
+## Quick Start
+```bash
+git clone git@github.ibm.com:gsaon/granite-speech-webgpu.git
+cd granite-speech-webgpu
+python3 -m http.server 8080
+```
+Open http://localhost:8080. Models (~1.4 GB) are downloaded automatically from Hugging Face on first load and cached by the browser.
+For non-localhost access, use the HTTPS server:
+```bash
+python3 serve.py
+```
+## Architecture
+The app uses [Transformers.js v4](https://huggingface.co/docs/transformers.js) to run the full inference pipeline in ~30 lines:
+1. `AutoProcessor` handles audio preprocessing (mel spectrogram, frame stacking, normalization)
+2. `GraniteSpeechForConditionalGeneration` manages encoder, embeddings, and autoregressive decoding with KV-cache
+3. `TextStreamer` provides streaming token output
+### Models
+| Component | Source | Size | Purpose |
+|-----------|--------|------|---------|
+| Granite Speech (q4f16) | [onnx-community/granite-4.0-1b-speech-ONNX](https://huggingface.co/onnx-community/granite-4.0-1b-speech-ONNX) | ~1.4 GB | Speech recognition & translation |
+| Silero VAD | Local | 2.1 MB | Voice activity detection |
+| Punctuation (EN) | [1-800-BAD-CODE](https://huggingface.co/1-800-BAD-CODE/punctuation_fullstop_truecase_english) | ~200 MB | English punctuation & capitalization |
+### Dependencies (loaded from CDN)
+- **Transformers.js 4.0.0-next.7**: Model loading, processing, and inference
+- **ONNX Runtime Web 1.24.3**: VAD and punctuation models (WASM)
+- **tinyld**: Language detection for automatic punctuation
+## Project Structure
+```
+granite-speech-webgpu/
+├── index.html          # Main HTML page
+├── app.js              # Main app (Transformers.js v4 inference + UI)
+├── vad.js              # Silero VAD integration (ONNX/WASM)
+├── punctuator.js       # Punctuation models (ONNX/WASM)
+├── style.css           # Styling
+├── pcs_vocab.json      # Punctuator vocabulary
+├── silero_vad.onnx     # VAD model
+├── punct_cap_seg_en.onnx  # English punctuator model
+└── serve.py            # HTTPS development server
+```
+## Acknowledgments
+- [IBM Granite Speech](https://huggingface.co/ibm-granite/granite-4.0-1b-speech)
+- [Transformers.js](https://huggingface.co/docs/transformers.js)
+- [ONNX Community](https://huggingface.co/onnx-community)
+- [Silero VAD](https://github.com/snakers4/silero-vad)
+- [Punctuation Model](https://huggingface.co/1-800-BAD-CODE/punctuation_fullstop_truecase_english)
+- [tinyld](https://github.com/komodojp/tinyld)

app.js CHANGED Viewed

@@ -1,76 +1,39 @@
 /**
  * Granite Speech WebGPU Demo
- * Uses ONNX Runtime Web for in-browser speech recognition
  */
-import { PreTrainedTokenizer } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.4.2';
 import { detect } from 'https://cdn.jsdelivr.net/npm/tinyld/+esm';
-// Check if ONNX Runtime is loaded
-if (typeof ort === 'undefined') {
-    console.error('ONNX Runtime Web not loaded! Check if the script tag is correct.');
-    alert('Failed to load ONNX Runtime. Please refresh the page.');
-} else {
-    // Configure WASM paths to CDN
-    ort.env.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.24.3/dist/';
-    // WASM settings - enable multi-threading for encoder performance
-    ort.env.wasm.numThreads = navigator.hardwareConcurrency || 4;
-    ort.env.wasm.simd = true;
-    // WebGPU settings
-    ort.env.webgpu = ort.env.webgpu || {};
-}
-// Model paths
-// Granite Speech ONNX models hosted on HF Hub
-const HF_MODEL_BASE = 'https://huggingface.co/ibm-granite/granite-4.0-1b-speech/resolve/main/onnx';
-const ENCODER_PATH = `${HF_MODEL_BASE}/audio_encoder_q4f32.onnx`;
-const EMBED_PATH = `${HF_MODEL_BASE}/embed_tokens_q4f16.onnx`;
-const DECODER_PATH = `${HF_MODEL_BASE}/decoder_model_merged_q4f16.onnx`;
-// Audio config from preprocessor_config.json
 const SAMPLE_RATE = 16000;
-const N_MELS = 80;
-const N_FFT = 512;
-const HOP_LENGTH = 160;
-const WIN_LENGTH = 400;
-// Model config
-const HIDDEN_SIZE = 2048;
-const VOCAB_SIZE = 100353;
-const BOS_TOKEN = 100257;
-const EOS_TOKEN = 100257;
-const PAD_TOKEN = 100256;
 const MAX_NEW_TOKENS = 256;
-// Note: embedding_multiplier (12) is likely already applied in the model weights
-// Prompt templates
-const PROMPT_PREFIX = 'USER: ';
-const PROMPTS = {
-    'transcribe': 'Transcribe the speech to text\n ASSISTANT:',
-    'translate_en': 'Translate the speech to English\n ASSISTANT:',
-    'translate_fr': 'Translate the speech to French\n ASSISTANT:',
-    'translate_de': 'Translate the speech to German\n ASSISTANT:',
-    'translate_es': 'Translate the speech to Spanish\n ASSISTANT:',
-    'translate_pt': 'Translate the speech to Portuguese\n ASSISTANT:',
-    'translate_ja': 'Translate the speech to Japanese\n ASSISTANT:',
 };
 // State
-let encoderSession = null;
-let embedSession = null;
-let decoderSession = null;
-let tokenizer = null;
 let isModelLoading = false;
 let currentAudioData = null;
-// Pre-computed prompt embeddings (populated at init)
-const promptEmbeddings = {
-    prefix: null,  // "USER: "
-    // suffix embeddings keyed by prompt name
-};
 // DOM Elements
 const statusDot = document.getElementById('statusDot');
 const statusText = document.getElementById('statusText');
@@ -145,130 +108,50 @@ async function checkWebGPU() {
     }
 }
-// Load tokenizer using transformers.js
-async function loadTokenizer() {
-    const [tokenizerJson, tokenizerConfig] = await Promise.all([
-        fetch('https://huggingface.co/ibm-granite/granite-4.0-1b-speech/resolve/main/tokenizer.json').then(r => r.json()),
-        fetch('https://huggingface.co/ibm-granite/granite-4.0-1b-speech/resolve/main/tokenizer_config.json').then(r => r.json())
-    ]);
-    return new PreTrainedTokenizer(tokenizerJson, tokenizerConfig);
-}
-// Get embeddings for token IDs (returns Float32Array)
-async function getEmbeddings(tokenIds) {
-    const idsTensor = new ort.Tensor('int64', BigInt64Array.from(tokenIds.map(BigInt)), [1, tokenIds.length]);
-    const output = await embedSession.run({ input_ids: idsTensor });
-    return {
-        data: new Float32Array(output.inputs_embeds.data),
-        seqLen: output.inputs_embeds.dims[1]
-    };
-}
-// Pre-compute embeddings for all prompts
-async function precomputePromptEmbeddings() {
-    // Prefix embedding
-    const prefixTokens = tokenizer.encode(PROMPT_PREFIX, { add_special_tokens: false });
-    promptEmbeddings.prefix = await getEmbeddings(prefixTokens);
-    // Suffix embeddings for each prompt
-    for (const [key, text] of Object.entries(PROMPTS)) {
-        const tokens = tokenizer.encode(text, { add_special_tokens: false });
-        promptEmbeddings[key] = await getEmbeddings(tokens);
-    }
-    console.log('Pre-computed embeddings for', Object.keys(promptEmbeddings).length, 'prompts');
-}
-// Session options - WebGPU only (no WASM fallback)
-const sessionOptions = {
-    executionProviders: ['webgpu'],
-    enableMemPattern: false,
-    enableCpuMemArena: false,
-    graphOptimizationLevel: 'basic',
-};
-// Force garbage collection pause
-async function gcPause() {
-    // Give browser time to garbage collect
-    await new Promise(resolve => setTimeout(resolve, 100));
-}
-// Load ONNX model with external data support
-async function loadModelWithExternalData(modelPath, options) {
-    // Check if external data file exists
-    const dataPath = modelPath.replace('.onnx', '.onnx_data');
-    const modelResponse = await fetch(modelPath);
-    const modelBuffer = await modelResponse.arrayBuffer();
-    const dataResponse = await fetch(dataPath);
-    if (!dataResponse.ok) {
-        // No external data, load model directly
-        return await ort.InferenceSession.create(modelBuffer, options);
-    }
-    const dataBuffer = await dataResponse.arrayBuffer();
-    // Extract filename from path for external data reference
-    const dataFileName = dataPath.split('/').pop();
-    // Create session with external data
-    const sessionOptionsWithData = {
-        ...options,
-        externalData: [
-            {
-                path: dataFileName,
-                data: dataBuffer,
-            }
-        ]
-    };
-    return await ort.InferenceSession.create(modelBuffer, sessionOptionsWithData);
-}
-// Initialize ONNX Runtime and load models
 async function initModels() {
     if (isModelLoading) return;
     isModelLoading = true;
-    setStatus('loading', 'Loading models...');
-    showProgress(true);
     try {
-        const hasWebGPU = await checkWebGPU();
-        updateProgress(10, 'Initializing ONNX Runtime...');
-        // Load tokenizer
-        updateProgress(15, 'Loading tokenizer...');
-        tokenizer = await loadTokenizer();
-        // Load models one at a time with GC pauses between
-        // Use loadModelWithExternalData to handle .onnx_data files
-        // Load encoder model (q4f32 with WebGPU)
-        updateProgress(20, 'Loading encoder model...');
-        encoderSession = await loadModelWithExternalData(ENCODER_PATH, sessionOptions);
-        await gcPause();
-        // Load embed tokens model
-        updateProgress(40, 'Loading embed tokens model...');
-        embedSession = await loadModelWithExternalData(EMBED_PATH, sessionOptions);
-        // Pre-compute prompt embeddings
-        updateProgress(50, 'Pre-computing prompt embeddings...');
-        await precomputePromptEmbeddings();
-        await gcPause();
-        // Load decoder model
-        updateProgress(60, 'Loading decoder model...');
-        decoderSession = await loadModelWithExternalData(DECODER_PATH, sessionOptions);
-        updateProgress(100, 'Models loaded!');
-        showProgress(false);
         setStatus('ready', 'Ready - Record or upload audio');
         enableControls(true);
@@ -277,7 +160,7 @@ async function initModels() {
         console.error('Error stack:', error?.stack);
         const errorMsg = error?.message || error?.toString() || 'Unknown error';
         setStatus('error', `Error: ${errorMsg}`);
-        showProgress(false);
         isModelLoading = false;
     }
 }
@@ -287,335 +170,42 @@ function enableControls(enabled) {
     audioFile.disabled = !enabled;
 }
-// Mel spectrogram computation
-// Uses custom implementation matching torchaudio
-function computeMelSpectrogram(audioData) {
-    // Pad signal with reflection (center=True, pad_mode='reflect')
-    const padLength = Math.floor(N_FFT / 2);
-    const paddedLength = audioData.length + 2 * padLength;
-    const paddedAudio = new Float32Array(paddedLength);
-    // Reflect padding at start: for position -i, use position i (not i-1)
-    // numpy reflect: for index -1, reflects to index 1
-    for (let i = 0; i < padLength; i++) {
-        // Position -(i+1) reflects to position (i+1)
-        const srcIdx = Math.min(i + 1, audioData.length - 1);
-        paddedAudio[padLength - 1 - i] = audioData[srcIdx];
-    }
-    // Copy original audio
-    for (let i = 0; i < audioData.length; i++) {
-        paddedAudio[padLength + i] = audioData[i];
-    }
-    // Reflect padding at end
-    for (let i = 0; i < padLength; i++) {
-        const srcIdx = Math.max(0, audioData.length - 2 - i);
-        paddedAudio[padLength + audioData.length + i] = audioData[srcIdx];
-    }
-    // Calculate number of frames
-    const numFrames = Math.floor((paddedLength - N_FFT) / HOP_LENGTH) + 1;
-    // Create mel filterbank (torchaudio HTK style)
-    const melFilterbank = createMelFilterbank(N_FFT, N_MELS, SAMPLE_RATE);
-    // Hann window (periodic=True like torchaudio)
-    const window = new Float32Array(WIN_LENGTH);
-    for (let i = 0; i < WIN_LENGTH; i++) {
-        window[i] = 0.5 * (1 - Math.cos(2 * Math.PI * i / WIN_LENGTH));
-    }
-    const melSpec = new Float32Array(numFrames * N_MELS);
-    // torch.stft center-pads the window when win_length < n_fft
-    // Window is placed at indices padLeft to padLeft+win_length
-    const padLeft = Math.floor((N_FFT - WIN_LENGTH) / 2);  // = 56
-    for (let frame = 0; frame < numFrames; frame++) {
-        const start = frame * HOP_LENGTH;
-        // Apply center-padded window (matching torch.stft behavior)
-        // Read n_fft samples, apply window centered in the middle
-        const windowed = new Float32Array(N_FFT);  // initialized to zeros
-        for (let i = 0; i < WIN_LENGTH; i++) {
-            windowed[padLeft + i] = paddedAudio[start + padLeft + i] * window[i];
-        }
-        // Compute power spectrum
-        const powerSpec = computePowerSpectrum(windowed);
-        // Apply mel filterbank and log10
-        for (let m = 0; m < N_MELS; m++) {
-            let sum = 0;
-            for (let k = 0; k < N_FFT / 2 + 1; k++) {
-                sum += powerSpec[k] * melFilterbank[m * (N_FFT / 2 + 1) + k];
-            }
-            melSpec[frame * N_MELS + m] = Math.log10(Math.max(sum, 1e-10));
-        }
-    }
-    return { data: melSpec, numFrames, numMels: N_MELS };
-}
-// Create mel filterbank (torchaudio HTK style)
-function createMelFilterbank(nfft, nMels, sampleRate) {
-    const numBins = nfft / 2 + 1;
-    const filterbank = new Float32Array(nMels * numBins);
-    // HTK mel scale
-    const hzToMel = (hz) => 2595 * Math.log10(1 + hz / 700);
-    const melToHz = (mel) => 700 * (Math.pow(10, mel / 2595) - 1);
-    const fMin = 0;
-    const fMax = sampleRate / 2;
-    const melMin = hzToMel(fMin);
-    const melMax = hzToMel(fMax);
-    // Create mel-spaced frequency points (n_mels + 2 points)
-    const fPts = new Float32Array(nMels + 2);
-    for (let i = 0; i < nMels + 2; i++) {
-        fPts[i] = melToHz(melMin + (melMax - melMin) * i / (nMels + 1));
-    }
-    // Create frequency array for each FFT bin
-    const allFreqs = new Float32Array(numBins);
-    for (let i = 0; i < numBins; i++) {
-        allFreqs[i] = i * sampleRate / nfft;
-    }
-    // Compute frequency differences
-    const fDiff = new Float32Array(nMels + 1);
-    for (let i = 0; i < nMels + 1; i++) {
-        fDiff[i] = fPts[i + 1] - fPts[i];
-    }
-    // Create triangular filters using slopes (torchaudio style)
-    for (let m = 0; m < nMels; m++) {
-        for (let k = 0; k < numBins; k++) {
-            const freq = allFreqs[k];
-            const lowSlope = (freq - fPts[m]) / fDiff[m];
-            const upSlope = (fPts[m + 2] - freq) / fDiff[m + 1];
-            filterbank[m * numBins + k] = Math.max(0, Math.min(lowSlope, upSlope));
-        }
-    }
-    return filterbank;
-}
-// Compute power spectrum using radix-2 FFT
-function computePowerSpectrum(signal) {
-    const n = signal.length;
-    // Use radix-2 FFT for power of 2 lengths
-    if ((n & (n - 1)) === 0) {
-        return computePowerSpectrumFFT(signal);
-    }
-    // Fallback to DFT for non-power-of-2
-    const spectrum = new Float32Array(n / 2 + 1);
-    for (let k = 0; k <= n / 2; k++) {
-        let real = 0, imag = 0;
-        for (let t = 0; t < n; t++) {
-            const angle = -2 * Math.PI * k * t / n;
-            real += signal[t] * Math.cos(angle);
-            imag += signal[t] * Math.sin(angle);
-        }
-        spectrum[k] = real * real + imag * imag;
-    }
-    return spectrum;
-}
-// Radix-2 FFT for power spectrum
-function computePowerSpectrumFFT(signal) {
-    const n = signal.length;
-    // Bit-reversal permutation
-    const real = new Float32Array(n);
-    const imag = new Float32Array(n);
-    for (let i = 0; i < n; i++) {
-        let j = 0;
-        let x = i;
-        for (let k = 0; k < Math.log2(n); k++) {
-            j = (j << 1) | (x & 1);
-            x >>= 1;
-        }
-        real[j] = signal[i];
-    }
-    // Cooley-Tukey FFT
-    for (let size = 2; size <= n; size *= 2) {
-        const halfSize = size / 2;
-        const step = Math.PI / halfSize;
-        for (let i = 0; i < n; i += size) {
-            for (let j = 0; j < halfSize; j++) {
-                const angle = -j * step;
-                const cos = Math.cos(angle);
-                const sin = Math.sin(angle);
-                const idx1 = i + j;
-                const idx2 = i + j + halfSize;
-                const tReal = cos * real[idx2] - sin * imag[idx2];
-                const tImag = sin * real[idx2] + cos * imag[idx2];
-                real[idx2] = real[idx1] - tReal;
-                imag[idx2] = imag[idx1] - tImag;
-                real[idx1] = real[idx1] + tReal;
-                imag[idx1] = imag[idx1] + tImag;
-            }
-        }
-    }
-    // Compute power spectrum (first half + DC and Nyquist)
-    const spectrum = new Float32Array(n / 2 + 1);
-    for (let k = 0; k <= n / 2; k++) {
-        spectrum[k] = real[k] * real[k] + imag[k] * imag[k];
-    }
-    return spectrum;
-}
-// Prepare audio features for encoder
-function prepareAudioFeatures(audioData) {
-    const melSpec = computeMelSpectrogram(audioData);
-    // Apply Granite Speech normalization:
-    // 1. Already have log10 mel from computeMelSpectrogram
-    // 2. Normalize: max(logmel, max - 8) / 4 + 1
-    const logmel = melSpec.data;
-    let maxVal = -Infinity;
-    for (let i = 0; i < logmel.length; i++) {
-        if (logmel[i] > maxVal) maxVal = logmel[i];
-    }
-    const normalized = new Float32Array(logmel.length);
-    for (let i = 0; i < logmel.length; i++) {
-        normalized[i] = Math.max(logmel[i], maxVal - 8) / 4 + 1;
-    }
-    // Remove last frame if odd
-    let numFrames = melSpec.numFrames;
-    if (numFrames % 2 === 1) {
-        numFrames -= 1;
-    }
-    // Stack 2 consecutive frames -> 160 features (80 mels * 2)
-    const stackedFrames = numFrames / 2;
-    const features = new Float32Array(stackedFrames * 160);
-    for (let t = 0; t < stackedFrames; t++) {
-        // First frame of pair
-        for (let m = 0; m < N_MELS; m++) {
-            features[t * 160 + m] = normalized[(t * 2) * N_MELS + m];
-        }
-        // Second frame of pair
-        for (let m = 0; m < N_MELS; m++) {
-            features[t * 160 + N_MELS + m] = normalized[(t * 2 + 1) * N_MELS + m];
-        }
-    }
-    return { data: features, shape: [1, stackedFrames, 160] };
-}
 // Transcribe a single audio segment and return the text
 async function transcribeSegment(audioSegment, onPartialResult) {
-    // Prepare audio features
-    const audioFeatures = prepareAudioFeatures(audioSegment);
-    // Run encoder
-    const encoderInput = new ort.Tensor('float32', audioFeatures.data, audioFeatures.shape);
-    const encoderOutput = await encoderSession.run({ input_features: encoderInput });
-    const audioEmbeddings = encoderOutput.audio_features;
-    // Get pre-computed prompt embeddings
-    const prefixEmbed = promptEmbeddings.prefix;
-    const suffixEmbed = promptEmbeddings[promptSelect.value] || promptEmbeddings['transcribe'];
-    // Concatenate embeddings using TypedArray.set()
-    const prefixSeqLen = prefixEmbed.seqLen;
-    const audioSeqLen = audioEmbeddings.dims[1];
-    const suffixSeqLen = suffixEmbed.seqLen;
-    const totalSeqLen = prefixSeqLen + audioSeqLen + suffixSeqLen;
-    const combinedEmbeds = new Float32Array(totalSeqLen * HIDDEN_SIZE);
-    combinedEmbeds.set(prefixEmbed.data, 0);
-    combinedEmbeds.set(new Float32Array(audioEmbeddings.data), prefixSeqLen * HIDDEN_SIZE);
-    combinedEmbeds.set(suffixEmbed.data, (prefixSeqLen + audioSeqLen) * HIDDEN_SIZE);
-    // Autoregressive generation
-    let generatedTokens = [];
-    let currentEmbeds = combinedEmbeds;
-    let currentSeqLen = totalSeqLen;
-    let pastKeyValues = null;
-    const numLayers = 40;
-    let totalSeqLenSoFar = totalSeqLen;
-    for (let step = 0; step < MAX_NEW_TOKENS; step++) {
-        const attentionMask = new BigInt64Array(totalSeqLenSoFar).fill(1n);
-        const embedsTensor = new ort.Tensor('float32', currentEmbeds, [1, currentSeqLen, HIDDEN_SIZE]);
-        const maskTensor = new ort.Tensor('int64', attentionMask, [1, totalSeqLenSoFar]);
-        const decoderInputs = {
-            inputs_embeds: embedsTensor,
-            attention_mask: maskTensor,
-        };
-        if (pastKeyValues) {
-            for (let i = 0; i < numLayers; i++) {
-                decoderInputs[`past_key_values.${i}.key`] = pastKeyValues[`present.${i}.key`];
-                decoderInputs[`past_key_values.${i}.value`] = pastKeyValues[`present.${i}.value`];
-            }
-        } else {
-            const emptyPast = new Uint16Array(0);
-            for (let i = 0; i < numLayers; i++) {
-                decoderInputs[`past_key_values.${i}.key`] = new ort.Tensor('float16', emptyPast, [1, 4, 0, 128]);
-                decoderInputs[`past_key_values.${i}.value`] = new ort.Tensor('float16', emptyPast, [1, 4, 0, 128]);
-            }
-        }
-        const decoderOutput = await decoderSession.run(decoderInputs);
-        pastKeyValues = decoderOutput;
-        const logitsFloat32 = Float32Array.from(decoderOutput.logits.data);
-        // Extract logits for last position and find argmax
-        const logitOffset = (currentSeqLen - 1) * VOCAB_SIZE;
-        const lastLogits = logitsFloat32.subarray(logitOffset, logitOffset + VOCAB_SIZE);
-        let nextToken = 0, maxVal = lastLogits[0];
-        for (let i = 1; i < VOCAB_SIZE; i++) {
-            if (lastLogits[i] > maxVal) { maxVal = lastLogits[i]; nextToken = i; }
-        }
-        // Avoid EOS on first token - take second best
-        if (step === 0 && nextToken === EOS_TOKEN) {
-            nextToken = 0; maxVal = -Infinity;
-            for (let i = 0; i < VOCAB_SIZE; i++) {
-                if (i !== EOS_TOKEN && lastLogits[i] > maxVal) { maxVal = lastLogits[i]; nextToken = i; }
             }
-        }
-        if (nextToken === EOS_TOKEN) {
-            break;
-        }
-        generatedTokens.push(nextToken);
-        // Callback for streaming updates
-        if (onPartialResult) {
-            onPartialResult(tokenizer.decode(generatedTokens));
-        }
-        const nextTokenTensor = new ort.Tensor('int64', BigInt64Array.from([BigInt(nextToken)]), [1, 1]);
-        const nextEmbedOutput = await embedSession.run({ input_ids: nextTokenTensor });
-        currentEmbeds = new Float32Array(nextEmbedOutput.inputs_embeds.data);
-        currentSeqLen = 1;
-        totalSeqLenSoFar += 1;
-    }
-    return tokenizer.decode(generatedTokens);
 }
 // Wait until audio playback reaches a specific time
@@ -634,7 +224,7 @@ function waitForPlaybackTime(targetTime) {
 // Run inference with segmentation and audio sync
 async function transcribe() {
-    if (!encoderSession || !embedSession || !decoderSession || !currentAudioData) {
         setStatus('error', 'Model or audio not ready');
         return;
     }

 /**
  * Granite Speech WebGPU Demo
+ * Uses Transformers.js v4 for in-browser speech recognition
  */
+import {
+    AutoProcessor,
+    GraniteSpeechForConditionalGeneration,
+    TextStreamer,
+} from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.0.0-next.7';
 import { detect } from 'https://cdn.jsdelivr.net/npm/tinyld/+esm';
+// Model
+const MODEL_ID = 'onnx-community/granite-4.0-1b-speech-ONNX';
+// Audio config
 const SAMPLE_RATE = 16000;
 const MAX_NEW_TOKENS = 256;
+// Task prompts — <|audio|> is expanded by the processor's chat template
+const TASK_PROMPTS = {
+    'transcribe':   '<|audio|>Transcribe the speech to text',
+    'translate_en': '<|audio|>Translate the speech to English',
+    'translate_fr': '<|audio|>Translate the speech to French',
+    'translate_de': '<|audio|>Translate the speech to German',
+    'translate_es': '<|audio|>Translate the speech to Spanish',
+    'translate_pt': '<|audio|>Translate the speech to Portuguese',
+    'translate_ja': '<|audio|>Translate the speech to Japanese',
 };
 // State
+let model = null;
+let processor = null;
 let isModelLoading = false;
 let currentAudioData = null;
 // DOM Elements
 const statusDot = document.getElementById('statusDot');
 const statusText = document.getElementById('statusText');
     }
 }
+// Initialize models using Transformers.js v4
 async function initModels() {
     if (isModelLoading) return;
     isModelLoading = true;
+    setStatus('loading', 'Loading processor...');
     try {
+        await checkWebGPU();
+        processor = await AutoProcessor.from_pretrained(MODEL_ID);
+        setStatus('loading', 'Downloading models...');
+        progressFill.style.width = '0%';
+        let lastProgressUpdate = 0;
+        const fileProgress = {};
+        model = await GraniteSpeechForConditionalGeneration.from_pretrained(MODEL_ID, {
+            dtype: {
+                audio_encoder: 'q4f16',
+                embed_tokens: 'q4f16',
+                decoder_model_merged: 'q4f16',
+            },
+            device: 'webgpu',
+            progress_callback: (progress) => {
+                if (progress.status === 'progress' && progress.total) {
+                    fileProgress[progress.file] = { loaded: progress.loaded, total: progress.total };
+                    const now = performance.now();
+                    if (now - lastProgressUpdate < 100) return;
+                    lastProgressUpdate = now;
+                    let totalLoaded = 0, totalSize = 0;
+                    for (const f of Object.values(fileProgress)) {
+                        totalLoaded += f.loaded;
+                        totalSize += f.total;
+                    }
+                    const pct = totalSize > 0 ? (totalLoaded / totalSize) * 100 : 0;
+                    progressFill.style.width = `${pct}%`;
+                    const mb = (totalLoaded / 1e6).toFixed(0);
+                    const totalMb = (totalSize / 1e6).toFixed(0);
+                    setStatus('loading', `Downloading models... ${mb} / ${totalMb} MB`);
+                }
+            },
+        });
+        progressFill.style.width = '0%';
         setStatus('ready', 'Ready - Record or upload audio');
         enableControls(true);
         console.error('Error stack:', error?.stack);
         const errorMsg = error?.message || error?.toString() || 'Unknown error';
         setStatus('error', `Error: ${errorMsg}`);
+        progressFill.style.width = '0%';
         isModelLoading = false;
     }
 }
     audioFile.disabled = !enabled;
 }
 // Transcribe a single audio segment and return the text
 async function transcribeSegment(audioSegment, onPartialResult) {
+    // Build prompt using chat template
+    const taskKey = promptSelect.value;
+    const content = TASK_PROMPTS[taskKey] || TASK_PROMPTS['transcribe'];
+    const messages = [{ role: 'user', content }];
+    const text = processor.tokenizer.apply_chat_template(messages, {
+        add_generation_prompt: true,
+        tokenize: false,
+    });
+    // Process text + audio into model inputs
+    const inputs = await processor(text, audioSegment, { sampling_rate: SAMPLE_RATE });
+    // Streaming via TextStreamer
+    let accumulated = '';
+    const streamer = new TextStreamer(processor.tokenizer, {
+        skip_prompt: true,
+        skip_special_tokens: true,
+        callback_function: (chunk) => {
+            accumulated += chunk;
+            if (onPartialResult) {
+                onPartialResult(accumulated);
             }
+        },
+    });
+    // Generate
+    await model.generate({
+        ...inputs,
+        max_new_tokens: MAX_NEW_TOKENS,
+        streamer,
+    });
+    return accumulated;
 }
 // Wait until audio playback reaches a specific time
 // Run inference with segmentation and audio sync
 async function transcribe() {
+    if (!model || !processor || !currentAudioData) {
         setStatus('error', 'Model or audio not ready');
         return;
     }

index.html CHANGED Viewed

@@ -120,16 +120,17 @@
             Made with
             <a href="https://huggingface.co/ibm-granite/granite-4.0-1b-speech" target="_blank">Granite Speech 4.0 1B</a>
             and
-            <a href="https://onnxruntime.ai/docs/tutorials/web/" target="_blank">ONNX Runtime Web</a>
             <br>
             <span class="privacy-note">Your audio and transcription never leave your device</span>
         </div>
         <div class="gpu-info" id="gpuInfo"></div>
     </div>
     <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web@1.24.3/dist/ort.all.min.js"></script>
     <script src="vad.js?v=1"></script>
     <script src="punctuator.js?v=3"></script>
-    <script type="module" src="app.js?v=53"></script>
 </body>
 </html>

             Made with
             <a href="https://huggingface.co/ibm-granite/granite-4.0-1b-speech" target="_blank">Granite Speech 4.0 1B</a>
             and
+            <a href="https://huggingface.co/docs/transformers.js" target="_blank">Transformers.js</a>
             <br>
             <span class="privacy-note">Your audio and transcription never leave your device</span>
         </div>
         <div class="gpu-info" id="gpuInfo"></div>
     </div>
+    <!-- ORT global is retained for VAD (vad.js) and punctuation (punctuator.js) which use WASM -->
     <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web@1.24.3/dist/ort.all.min.js"></script>
     <script src="vad.js?v=1"></script>
     <script src="punctuator.js?v=3"></script>
+    <script type="module" src="app.js?v=54"></script>
 </body>
 </html>