Spaces:

Luminia
/

WhisperLiveKit

Running

App Files Files Community

WhisperLiveKit / README.md

Nekochu

Inital commit

395c488 verified 26 days ago

preview code

raw

history blame contribute delete

3.67 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

metadata

title: WhisperLiveKit
emoji: 🎙️
colorFrom: yellow
colorTo: pink
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false
short_description: Record + transcript with mic + screen/system audio meeting

Transcription Comparison - For your meeting notes!

Record meetings with mic + screen/system audio, compare 4 transcription engines side by side, and identify speakers automatically. Default engine: Parakeet TDT v3 (best accuracy).

Engines

Engine	Type	Details
Parakeet TDT v3 (default)	Batch, server CPU	onnx-asr, 25 languages, best accuracy overall
WhisperLiveKit	Real-time WebSocket, server CPU	Whisper large-v3-turbo, SimulStreaming
Voxtral-Mini-4B-Realtime-2602	Browser WebGPU	ONNX q4f16 via transformers.js, zero server cost
Nemotron Streaming	Batch, server CPU	sherpa-onnx int8, English only, fastest processing

Speaker Identification

Two modes depending on "Speaker detection" checkbox:

Speaker detection ON (pyannote diarization)

pyannote speaker-diarization-3.1 pipeline with wespeaker embeddings
Cosine similarity post-merge (threshold 0.6) to fix same-speaker splitting
Browser fallback for Voxtral: pyannote-segmentation-3.0 ONNX (Xenova method, max 3 speakers)
Models bundled in repo (models/), no HF_TOKEN needed

Speaker detection OFF (dual-track routing)

Records mic and screen as separate audio tracks
Transcribes both in parallel, labels as YOU (mic) vs SCREEN (other people)
No AI diarization needed, guaranteed separation

Output interleaved by time with timestamps:

YOU [00:01]: Bonjour, ceci est un test...
SCREEN [00:24]: Les amis, arretez tout...
YOU [00:43]: Fin de la video...

Supported engines:

Engine	Dual-track	Method
Parakeet TDT v3	Yes	Parallel transcription, word timestamps, precise
Nemotron Streaming	Yes	Parallel transcription, token timestamps, precise
Voxtral-Mini-4B	Yes	Energy-based routing (mic vs screen RMS), approximate
WhisperLiveKit	No	Streams live, can't split retroactively

Features

Toggle-based engine selection (activate any combination)
Screen/system audio capture (Chrome, enabled by default)
Webcam emotion detection (MobileViT-XXS ONNX, browser-side)
File upload for testing without recording
Audio download link after recording
Word-level timestamp alignment (Parakeet .with_timestamps() + overlap-based speaker matching)

CLI

python app.py "meeting_recording.mp3"

SPEAKER 1 [00:01]: Hello everyone, let's start the meeting.
SPEAKER 2 [00:05]: Thanks for organizing this.
SPEAKER 1 [00:08]: First item on the agenda...

Uses Parakeet TDT v3 + pyannote diarization. Words aligned to speakers by greatest temporal overlap.

Architecture

Single app.py (~2000 lines) with inline CSS/JS/HTML
Gradio 6 + FastAPI: iframe injection via <img onerror> trick (Gradio 6 strips <script> tags)
Custom routes registered before gr.mount_gradio_app() so FastAPI endpoints take priority
Voxtral audio capture via AudioWorklet at 16kHz with chunked buffer (no O(n^2) copy)

Limitations

WhisperLiveKit large-v3-turbo processes slower than real-time on 2 vCPU (VAC skips silence to help catch up, ready_to_stop wait prevents truncation)
Voxtral browser diarization is noisy (ONNX segmentation only, no embedding clustering)
Dual-track routing gives YOU vs SCREEN, not individual speakers within screen audio
WebM recordings from MediaRecorder lack duration metadata