WhisperLiveKit / README.md
Nekochu's picture
Inital commit
395c488 verified

A newer version of the Gradio SDK is available: 6.12.0

Upgrade
metadata
title: WhisperLiveKit
emoji: 🎙️
colorFrom: yellow
colorTo: pink
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false
short_description: Record + transcript with mic + screen/system audio meeting

Transcription Comparison - For your meeting notes!

Record meetings with mic + screen/system audio, compare 4 transcription engines side by side, and identify speakers automatically. Default engine: Parakeet TDT v3 (best accuracy).

Engines

Engine Type Details
Parakeet TDT v3 (default) Batch, server CPU onnx-asr, 25 languages, best accuracy overall
WhisperLiveKit Real-time WebSocket, server CPU Whisper large-v3-turbo, SimulStreaming
Voxtral-Mini-4B-Realtime-2602 Browser WebGPU ONNX q4f16 via transformers.js, zero server cost
Nemotron Streaming Batch, server CPU sherpa-onnx int8, English only, fastest processing

Speaker Identification

Two modes depending on "Speaker detection" checkbox:

Speaker detection ON (pyannote diarization)

  • pyannote speaker-diarization-3.1 pipeline with wespeaker embeddings
  • Cosine similarity post-merge (threshold 0.6) to fix same-speaker splitting
  • Browser fallback for Voxtral: pyannote-segmentation-3.0 ONNX (Xenova method, max 3 speakers)
  • Models bundled in repo (models/), no HF_TOKEN needed

Speaker detection OFF (dual-track routing)

  • Records mic and screen as separate audio tracks
  • Transcribes both in parallel, labels as YOU (mic) vs SCREEN (other people)
  • No AI diarization needed, guaranteed separation
  • Output interleaved by time with timestamps:
    YOU [00:01]: Bonjour, ceci est un test...
    SCREEN [00:24]: Les amis, arretez tout...
    YOU [00:43]: Fin de la video...
    
  • Supported engines:
Engine Dual-track Method
Parakeet TDT v3 Yes Parallel transcription, word timestamps, precise
Nemotron Streaming Yes Parallel transcription, token timestamps, precise
Voxtral-Mini-4B Yes Energy-based routing (mic vs screen RMS), approximate
WhisperLiveKit No Streams live, can't split retroactively

Features

  • Toggle-based engine selection (activate any combination)
  • Screen/system audio capture (Chrome, enabled by default)
  • Webcam emotion detection (MobileViT-XXS ONNX, browser-side)
  • File upload for testing without recording
  • Audio download link after recording
  • Word-level timestamp alignment (Parakeet .with_timestamps() + overlap-based speaker matching)

CLI

python app.py "meeting_recording.mp3"
SPEAKER 1 [00:01]: Hello everyone, let's start the meeting.
SPEAKER 2 [00:05]: Thanks for organizing this.
SPEAKER 1 [00:08]: First item on the agenda...

Uses Parakeet TDT v3 + pyannote diarization. Words aligned to speakers by greatest temporal overlap.

Architecture

  • Single app.py (~2000 lines) with inline CSS/JS/HTML
  • Gradio 6 + FastAPI: iframe injection via <img onerror> trick (Gradio 6 strips <script> tags)
  • Custom routes registered before gr.mount_gradio_app() so FastAPI endpoints take priority
  • Voxtral audio capture via AudioWorklet at 16kHz with chunked buffer (no O(n^2) copy)

Limitations

  • WhisperLiveKit large-v3-turbo processes slower than real-time on 2 vCPU (VAC skips silence to help catch up, ready_to_stop wait prevents truncation)
  • Voxtral browser diarization is noisy (ONNX segmentation only, no embedding clustering)
  • Dual-track routing gives YOU vs SCREEN, not individual speakers within screen audio
  • WebM recordings from MediaRecorder lack duration metadata