--- title: WhisperLiveKit emoji: 🎙️ colorFrom: yellow colorTo: pink sdk: gradio sdk_version: 6.9.0 app_file: app.py pinned: false short_description: Record + transcript with mic + screen/system audio meeting --- # Transcription Comparison - For your meeting notes! Record meetings with mic + screen/system audio, compare 4 transcription engines side by side, and identify speakers automatically. Default engine: **Parakeet TDT v3** (best accuracy). ## Engines | Engine | Type | Details | |-|-|-| | **Parakeet TDT v3** (default) | Batch, server CPU | onnx-asr, 25 languages, best accuracy overall | | **WhisperLiveKit** | Real-time WebSocket, server CPU | Whisper large-v3-turbo, SimulStreaming | | **Voxtral-Mini-4B-Realtime-2602** | Browser WebGPU | ONNX q4f16 via transformers.js, zero server cost | | **Nemotron Streaming** | Batch, server CPU | sherpa-onnx int8, English only, fastest processing | ## Speaker Identification Two modes depending on "Speaker detection" checkbox: ### Speaker detection ON (pyannote diarization) - pyannote speaker-diarization-3.1 pipeline with wespeaker embeddings - Cosine similarity post-merge (threshold 0.6) to fix same-speaker splitting - Browser fallback for Voxtral: pyannote-segmentation-3.0 ONNX (Xenova method, max 3 speakers) - Models bundled in repo (`models/`), no HF_TOKEN needed ### Speaker detection OFF (dual-track routing) - Records mic and screen as **separate audio tracks** - Transcribes both in parallel, labels as **YOU** (mic) vs **SCREEN** (other people) - No AI diarization needed, guaranteed separation - Output interleaved by time with timestamps: ``` YOU [00:01]: Bonjour, ceci est un test... SCREEN [00:24]: Les amis, arretez tout... YOU [00:43]: Fin de la video... ``` - Supported engines: | Engine | Dual-track | Method | |-|-|-| | Parakeet TDT v3 | Yes | Parallel transcription, word timestamps, precise | | Nemotron Streaming | Yes | Parallel transcription, token timestamps, precise | | Voxtral-Mini-4B | Yes | Energy-based routing (mic vs screen RMS), approximate | | WhisperLiveKit | No | Streams live, can't split retroactively | ## Features - Toggle-based engine selection (activate any combination) - Screen/system audio capture (Chrome, enabled by default) - Webcam emotion detection (MobileViT-XXS ONNX, browser-side) - File upload for testing without recording - Audio download link after recording - Word-level timestamp alignment (Parakeet `.with_timestamps()` + overlap-based speaker matching) ## CLI ```bash python app.py "meeting_recording.mp3" ``` ``` SPEAKER 1 [00:01]: Hello everyone, let's start the meeting. SPEAKER 2 [00:05]: Thanks for organizing this. SPEAKER 1 [00:08]: First item on the agenda... ``` Uses Parakeet TDT v3 + pyannote diarization. Words aligned to speakers by greatest temporal overlap. ## Architecture - Single `app.py` (~2000 lines) with inline CSS/JS/HTML - Gradio 6 + FastAPI: iframe injection via `` trick (Gradio 6 strips `