Spaces:
Running
Running
| title: WhisperLiveKit | |
| emoji: 🎙️ | |
| colorFrom: yellow | |
| colorTo: pink | |
| sdk: gradio | |
| sdk_version: 6.9.0 | |
| app_file: app.py | |
| pinned: false | |
| short_description: Record + transcript with mic + screen/system audio meeting | |
| # Transcription Comparison - For your meeting notes! | |
| Record meetings with mic + screen/system audio, compare 4 transcription engines side by side, and identify speakers automatically. Default engine: **Parakeet TDT v3** (best accuracy). | |
| ## Engines | |
| | Engine | Type | Details | | |
| |-|-|-| | |
| | **Parakeet TDT v3** (default) | Batch, server CPU | onnx-asr, 25 languages, best accuracy overall | | |
| | **WhisperLiveKit** | Real-time WebSocket, server CPU | Whisper large-v3-turbo, SimulStreaming | | |
| | **Voxtral-Mini-4B-Realtime-2602** | Browser WebGPU | ONNX q4f16 via transformers.js, zero server cost | | |
| | **Nemotron Streaming** | Batch, server CPU | sherpa-onnx int8, English only, fastest processing | | |
| ## Speaker Identification | |
| Two modes depending on "Speaker detection" checkbox: | |
| ### Speaker detection ON (pyannote diarization) | |
| - pyannote speaker-diarization-3.1 pipeline with wespeaker embeddings | |
| - Cosine similarity post-merge (threshold 0.6) to fix same-speaker splitting | |
| - Browser fallback for Voxtral: pyannote-segmentation-3.0 ONNX (Xenova method, max 3 speakers) | |
| - Models bundled in repo (`models/`), no HF_TOKEN needed | |
| ### Speaker detection OFF (dual-track routing) | |
| - Records mic and screen as **separate audio tracks** | |
| - Transcribes both in parallel, labels as **YOU** (mic) vs **SCREEN** (other people) | |
| - No AI diarization needed, guaranteed separation | |
| - Output interleaved by time with timestamps: | |
| ``` | |
| YOU [00:01]: Bonjour, ceci est un test... | |
| SCREEN [00:24]: Les amis, arretez tout... | |
| YOU [00:43]: Fin de la video... | |
| ``` | |
| - Supported engines: | |
| | Engine | Dual-track | Method | | |
| |-|-|-| | |
| | Parakeet TDT v3 | Yes | Parallel transcription, word timestamps, precise | | |
| | Nemotron Streaming | Yes | Parallel transcription, token timestamps, precise | | |
| | Voxtral-Mini-4B | Yes | Energy-based routing (mic vs screen RMS), approximate | | |
| | WhisperLiveKit | No | Streams live, can't split retroactively | | |
| ## Features | |
| - Toggle-based engine selection (activate any combination) | |
| - Screen/system audio capture (Chrome, enabled by default) | |
| - Webcam emotion detection (MobileViT-XXS ONNX, browser-side) | |
| - File upload for testing without recording | |
| - Audio download link after recording | |
| - Word-level timestamp alignment (Parakeet `.with_timestamps()` + overlap-based speaker matching) | |
| ## CLI | |
| ```bash | |
| python app.py "meeting_recording.mp3" | |
| ``` | |
| ``` | |
| SPEAKER 1 [00:01]: Hello everyone, let's start the meeting. | |
| SPEAKER 2 [00:05]: Thanks for organizing this. | |
| SPEAKER 1 [00:08]: First item on the agenda... | |
| ``` | |
| Uses Parakeet TDT v3 + pyannote diarization. Words aligned to speakers by greatest temporal overlap. | |
| ## Architecture | |
| - Single `app.py` (~2000 lines) with inline CSS/JS/HTML | |
| - Gradio 6 + FastAPI: iframe injection via `<img onerror>` trick (Gradio 6 strips `<script>` tags) | |
| - Custom routes registered before `gr.mount_gradio_app()` so FastAPI endpoints take priority | |
| - Voxtral audio capture via AudioWorklet at 16kHz with chunked buffer (no O(n^2) copy) | |
| ## Limitations | |
| - WhisperLiveKit large-v3-turbo processes slower than real-time on 2 vCPU (VAC skips silence to help catch up, `ready_to_stop` wait prevents truncation) | |
| - Voxtral browser diarization is noisy (ONNX segmentation only, no embedding clustering) | |
| - Dual-track routing gives YOU vs SCREEN, not individual speakers within screen audio | |
| - WebM recordings from MediaRecorder lack duration metadata | |