STT 1B EN/FR β€” Q4 WebGPU

Q4-quantized weights for kyutai/stt-1b-en_fr, packaged for client-side browser inference via WASM + WebGPU.

Runs entirely in the browser β€” no server required. English + French, streaming, ~1B parameters.

Try the demo β†’

Files

File Size Description
stt-1b-en_fr-q4_0.gguf 531 MB STT transformer weights (Q4_0 quantized)
mimi-encoder-f16.safetensors 107 MB Mimi audio codec encoder (f16)
tokenizer.model 118 KB SentencePiece tokenizer (32k vocab, EN+FR)

Usage

These weights are consumed by stt-web, a Rust/WASM + WebGPU speech-to-text engine built with Burn.

import { SttClient } from './stt-client.js';

const stt = new SttClient({
    onTranscript: (text, isFinal) => console.log(text),
    onStatus: (text, ready) => console.log(text),
});

await stt.init();
await stt.startRecording();

Model weights are fetched from this repo automatically and cached by the browser.

Requirements

  • Chrome 113+ or Edge 113+ (WebGPU required)
  • HTTPS (required for WebGPU)
  • ~640 MB download on first load (cached afterward)

Pipeline

Microphone β†’ AudioWorklet (24kHz mono)
  β†’ Mimi codec [WASM, CPU] β†’ 32 codebook tokens/frame at 12.5Hz
    β†’ STT transformer [WASM, WebGPU] β†’ text tokens
      β†’ SentencePiece detokenizer β†’ transcript

Model Details

  • Base model: kyutai/stt-1b-en_fr by Kyutai
  • Architecture: Decoder-only transformer with delayed-streams modeling
  • Parameters: ~1B (STT) + ~25M (Mimi codec encoder)
  • Quantization: Q4_0 (4-bit) for STT transformer, f16 for Mimi codec
  • Languages: English, French
  • Streaming latency: ~500ms text delay (6 frames at 12.5Hz)
  • License: CC-BY 4.0 (same as original)

Quantization

The STT transformer weights were quantized from f32 to Q4_0 using a custom GGUF packer. Dequantization happens on-GPU via WGSL compute shaders at inference time. The Mimi codec encoder is stored at f16 as it runs on CPU via WASM.

Citation

If you use this model, please cite the original authors:

@techreport{kyutai2024stt,
    author = {Kyutai},
    title = {Speech-To-Text models},
    institution = {Kyutai},
    year = {2024},
    url = {https://huggingface.co/kyutai/stt-1b-en_fr},
}

Disclaimer

This is an independent port by idle intelligence, not affiliated with or endorsed by Kyutai Labs. Transcription quality may differ from the original PyTorch implementation due to quantization.

Downloads last month
113
GGUF
Model size
1.0B params
Architecture
kyutai-stt
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for idle-intelligence/stt-1b-en_fr-q4_0-webgpu

Quantized
(1)
this model