Spaces:
Running
Running
A newer version of the Gradio SDK is available: 6.12.0
metadata
title: WhisperLiveKit
emoji: 🎙️
colorFrom: yellow
colorTo: pink
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false
short_description: Record + transcript with mic + screen/system audio meeting
Transcription Comparison - For your meeting notes!
Record meetings with mic + screen/system audio, compare 4 transcription engines side by side, and identify speakers automatically. Default engine: Parakeet TDT v3 (best accuracy).
Engines
| Engine | Type | Details |
|---|---|---|
| Parakeet TDT v3 (default) | Batch, server CPU | onnx-asr, 25 languages, best accuracy overall |
| WhisperLiveKit | Real-time WebSocket, server CPU | Whisper large-v3-turbo, SimulStreaming |
| Voxtral-Mini-4B-Realtime-2602 | Browser WebGPU | ONNX q4f16 via transformers.js, zero server cost |
| Nemotron Streaming | Batch, server CPU | sherpa-onnx int8, English only, fastest processing |
Speaker Identification
Two modes depending on "Speaker detection" checkbox:
Speaker detection ON (pyannote diarization)
- pyannote speaker-diarization-3.1 pipeline with wespeaker embeddings
- Cosine similarity post-merge (threshold 0.6) to fix same-speaker splitting
- Browser fallback for Voxtral: pyannote-segmentation-3.0 ONNX (Xenova method, max 3 speakers)
- Models bundled in repo (
models/), no HF_TOKEN needed
Speaker detection OFF (dual-track routing)
- Records mic and screen as separate audio tracks
- Transcribes both in parallel, labels as YOU (mic) vs SCREEN (other people)
- No AI diarization needed, guaranteed separation
- Output interleaved by time with timestamps:
YOU [00:01]: Bonjour, ceci est un test... SCREEN [00:24]: Les amis, arretez tout... YOU [00:43]: Fin de la video... - Supported engines:
| Engine | Dual-track | Method |
|---|---|---|
| Parakeet TDT v3 | Yes | Parallel transcription, word timestamps, precise |
| Nemotron Streaming | Yes | Parallel transcription, token timestamps, precise |
| Voxtral-Mini-4B | Yes | Energy-based routing (mic vs screen RMS), approximate |
| WhisperLiveKit | No | Streams live, can't split retroactively |
Features
- Toggle-based engine selection (activate any combination)
- Screen/system audio capture (Chrome, enabled by default)
- Webcam emotion detection (MobileViT-XXS ONNX, browser-side)
- File upload for testing without recording
- Audio download link after recording
- Word-level timestamp alignment (Parakeet
.with_timestamps()+ overlap-based speaker matching)
CLI
python app.py "meeting_recording.mp3"
SPEAKER 1 [00:01]: Hello everyone, let's start the meeting.
SPEAKER 2 [00:05]: Thanks for organizing this.
SPEAKER 1 [00:08]: First item on the agenda...
Uses Parakeet TDT v3 + pyannote diarization. Words aligned to speakers by greatest temporal overlap.
Architecture
- Single
app.py(~2000 lines) with inline CSS/JS/HTML - Gradio 6 + FastAPI: iframe injection via
<img onerror>trick (Gradio 6 strips<script>tags) - Custom routes registered before
gr.mount_gradio_app()so FastAPI endpoints take priority - Voxtral audio capture via AudioWorklet at 16kHz with chunked buffer (no O(n^2) copy)
Limitations
- WhisperLiveKit large-v3-turbo processes slower than real-time on 2 vCPU (VAC skips silence to help catch up,
ready_to_stopwait prevents truncation) - Voxtral browser diarization is noisy (ONNX segmentation only, no embedding clustering)
- Dual-track routing gives YOU vs SCREEN, not individual speakers within screen audio
- WebM recordings from MediaRecorder lack duration metadata