Spaces:

Luminia
/

WhisperLiveKit

Running

App Files Files Community

WhisperLiveKit / README.md

Nekochu

Inital commit

395c488 verified 26 days ago

preview code

raw

history blame contribute delete

3.67 kB

	---
	title: WhisperLiveKit
	emoji: 🎙️
	colorFrom: yellow
	colorTo: pink
	sdk: gradio
	sdk_version: 6.9.0
	app_file: app.py
	pinned: false
	short_description: Record + transcript with mic + screen/system audio meeting
	---

	# Transcription Comparison - For your meeting notes!

	Record meetings with mic + screen/system audio, compare 4 transcription engines side by side, and identify speakers automatically. Default engine: Parakeet TDT v3 (best accuracy).

	## Engines

	\| Engine \| Type \| Details \|
	\|-\|-\|-\|
	\| Parakeet TDT v3 (default) \| Batch, server CPU \| onnx-asr, 25 languages, best accuracy overall \|
	\| WhisperLiveKit \| Real-time WebSocket, server CPU \| Whisper large-v3-turbo, SimulStreaming \|
	\| Voxtral-Mini-4B-Realtime-2602 \| Browser WebGPU \| ONNX q4f16 via transformers.js, zero server cost \|
	\| Nemotron Streaming \| Batch, server CPU \| sherpa-onnx int8, English only, fastest processing \|

	## Speaker Identification

	Two modes depending on "Speaker detection" checkbox:

	### Speaker detection ON (pyannote diarization)
	- pyannote speaker-diarization-3.1 pipeline with wespeaker embeddings
	- Cosine similarity post-merge (threshold 0.6) to fix same-speaker splitting
	- Browser fallback for Voxtral: pyannote-segmentation-3.0 ONNX (Xenova method, max 3 speakers)
	- Models bundled in repo (`models/`), no HF_TOKEN needed

	### Speaker detection OFF (dual-track routing)
	- Records mic and screen as separate audio tracks
	- Transcribes both in parallel, labels as YOU (mic) vs SCREEN (other people)
	- No AI diarization needed, guaranteed separation
	- Output interleaved by time with timestamps:
	```
	YOU [00:01]: Bonjour, ceci est un test...
	SCREEN [00:24]: Les amis, arretez tout...
	YOU [00:43]: Fin de la video...
	```
	- Supported engines:

	\| Engine \| Dual-track \| Method \|
	\|-\|-\|-\|
	\| Parakeet TDT v3 \| Yes \| Parallel transcription, word timestamps, precise \|
	\| Nemotron Streaming \| Yes \| Parallel transcription, token timestamps, precise \|
	\| Voxtral-Mini-4B \| Yes \| Energy-based routing (mic vs screen RMS), approximate \|
	\| WhisperLiveKit \| No \| Streams live, can't split retroactively \|

	## Features

	- Toggle-based engine selection (activate any combination)
	- Screen/system audio capture (Chrome, enabled by default)
	- Webcam emotion detection (MobileViT-XXS ONNX, browser-side)
	- File upload for testing without recording
	- Audio download link after recording
	- Word-level timestamp alignment (Parakeet `.with_timestamps()` + overlap-based speaker matching)

	## CLI

	```bash
	python app.py "meeting_recording.mp3"
	```

	```
	SPEAKER 1 [00:01]: Hello everyone, let's start the meeting.
	SPEAKER 2 [00:05]: Thanks for organizing this.
	SPEAKER 1 [00:08]: First item on the agenda...
	```

	Uses Parakeet TDT v3 + pyannote diarization. Words aligned to speakers by greatest temporal overlap.

	## Architecture

	- Single `app.py` (~2000 lines) with inline CSS/JS/HTML
	- Gradio 6 + FastAPI: iframe injection via `<img onerror>` trick (Gradio 6 strips `<script>` tags)
	- Custom routes registered before `gr.mount_gradio_app()` so FastAPI endpoints take priority
	- Voxtral audio capture via AudioWorklet at 16kHz with chunked buffer (no O(n^2) copy)

	## Limitations

	- WhisperLiveKit large-v3-turbo processes slower than real-time on 2 vCPU (VAC skips silence to help catch up, `ready_to_stop` wait prevents truncation)
	- Voxtral browser diarization is noisy (ONNX segmentation only, no embedding clustering)
	- Dual-track routing gives YOU vs SCREEN, not individual speakers within screen audio
	- WebM recordings from MediaRecorder lack duration metadata