WhisperLiveKit / README.md
Nekochu's picture
Inital commit
395c488 verified
---
title: WhisperLiveKit
emoji: 🎙️
colorFrom: yellow
colorTo: pink
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false
short_description: Record + transcript with mic + screen/system audio meeting
---
# Transcription Comparison - For your meeting notes!
Record meetings with mic + screen/system audio, compare 4 transcription engines side by side, and identify speakers automatically. Default engine: **Parakeet TDT v3** (best accuracy).
## Engines
| Engine | Type | Details |
|-|-|-|
| **Parakeet TDT v3** (default) | Batch, server CPU | onnx-asr, 25 languages, best accuracy overall |
| **WhisperLiveKit** | Real-time WebSocket, server CPU | Whisper large-v3-turbo, SimulStreaming |
| **Voxtral-Mini-4B-Realtime-2602** | Browser WebGPU | ONNX q4f16 via transformers.js, zero server cost |
| **Nemotron Streaming** | Batch, server CPU | sherpa-onnx int8, English only, fastest processing |
## Speaker Identification
Two modes depending on "Speaker detection" checkbox:
### Speaker detection ON (pyannote diarization)
- pyannote speaker-diarization-3.1 pipeline with wespeaker embeddings
- Cosine similarity post-merge (threshold 0.6) to fix same-speaker splitting
- Browser fallback for Voxtral: pyannote-segmentation-3.0 ONNX (Xenova method, max 3 speakers)
- Models bundled in repo (`models/`), no HF_TOKEN needed
### Speaker detection OFF (dual-track routing)
- Records mic and screen as **separate audio tracks**
- Transcribes both in parallel, labels as **YOU** (mic) vs **SCREEN** (other people)
- No AI diarization needed, guaranteed separation
- Output interleaved by time with timestamps:
```
YOU [00:01]: Bonjour, ceci est un test...
SCREEN [00:24]: Les amis, arretez tout...
YOU [00:43]: Fin de la video...
```
- Supported engines:
| Engine | Dual-track | Method |
|-|-|-|
| Parakeet TDT v3 | Yes | Parallel transcription, word timestamps, precise |
| Nemotron Streaming | Yes | Parallel transcription, token timestamps, precise |
| Voxtral-Mini-4B | Yes | Energy-based routing (mic vs screen RMS), approximate |
| WhisperLiveKit | No | Streams live, can't split retroactively |
## Features
- Toggle-based engine selection (activate any combination)
- Screen/system audio capture (Chrome, enabled by default)
- Webcam emotion detection (MobileViT-XXS ONNX, browser-side)
- File upload for testing without recording
- Audio download link after recording
- Word-level timestamp alignment (Parakeet `.with_timestamps()` + overlap-based speaker matching)
## CLI
```bash
python app.py "meeting_recording.mp3"
```
```
SPEAKER 1 [00:01]: Hello everyone, let's start the meeting.
SPEAKER 2 [00:05]: Thanks for organizing this.
SPEAKER 1 [00:08]: First item on the agenda...
```
Uses Parakeet TDT v3 + pyannote diarization. Words aligned to speakers by greatest temporal overlap.
## Architecture
- Single `app.py` (~2000 lines) with inline CSS/JS/HTML
- Gradio 6 + FastAPI: iframe injection via `<img onerror>` trick (Gradio 6 strips `<script>` tags)
- Custom routes registered before `gr.mount_gradio_app()` so FastAPI endpoints take priority
- Voxtral audio capture via AudioWorklet at 16kHz with chunked buffer (no O(n^2) copy)
## Limitations
- WhisperLiveKit large-v3-turbo processes slower than real-time on 2 vCPU (VAC skips silence to help catch up, `ready_to_stop` wait prevents truncation)
- Voxtral browser diarization is noisy (ONNX segmentation only, no embedding clustering)
- Dual-track routing gives YOU vs SCREEN, not individual speakers within screen audio
- WebM recordings from MediaRecorder lack duration metadata