Spaces:

vyluong
/

PoC_ASR_v5

Sleeping

PoC_ASR_v5 / README.md

PoC deployment

4d6b6c4 verified 3 months ago

3.21 kB

title: PrecisionVoice
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false

PrecisionVoice - STT & Speaker Diarization

A production-ready Speech-to-Text and Speaker Diarization web application using FastAPI, faster-whisper, and pyannote.audio.

🎙️ Speech-to-Text using erax-ai/EraX-WoW-Turbo-V1.1-CT2 (8x faster, 8 Vietnamese dialects)
👥 Speaker Diarization using pyannote/speaker-diarization-3.1
🧼 Speech Enhancement using SpeechBrain SepFormer DNS4 (noise + reverb removal)
🔇 Voice Activity Detection using Silero VAD v5 (prevents hallucination)
🎤 Vocal Isolation using MDX-Net (UVR-MDX-NET-Voc_FT)
🔄 Automatic speaker-transcript alignment
📥 Download results in TXT or SRT format
🐳 Docker-ready with persistent model caching and GPU support
🐳 Docker-ready with persistent model caching and GPU support

Clone and configure:

cp .env.example .env
# Edit .env and add your HuggingFace token

The system uses a state-of-the-art multi-stage pipeline to ensure maximum accuracy:

Speech Enhancement: Background noise and reverb are removed using SpeechBrain SepFormer (DNS4 Challenge winner).
Vocal Isolation: Vocals are separated from background music using MDX-Net.
VAD Filtering: Silence is removed using Silero VAD v5 to prevent ASR hallucination.
Refinement: Highpass filtering and EBU R128 loudness normalization.
Transcription: High-precision Vietnamese transcription using PhoWhisper.
Diarization: Segmenting audio by speaker using Pyannote 3.1.
Alignment: Merging transcripts with speaker segments + timestamp reconstruction.

Variable	Default	Description
`HF_TOKEN`	-	Required for Pyannote models
`ENABLE_SPEECH_ENHANCEMENT`	`True`	Toggle SpeechBrain speech enhancement
`ENHANCEMENT_MODEL`	`speechbrain/sepformer-dns4-16k-enhancement`	Model for speech enhancement
`ENABLE_SILERO_VAD`	`True`	Toggle Silero VAD for hallucination prevention
`ENABLE_VOCAL_SEPARATION`	`True`	Toggle MDX-Net vocal isolation
`MDX_MODEL`	`UVR-MDX-NET-Voc_FT`	Model for vocal separation
`DEVICE`	`auto`	`cuda`, `cpu`, or `auto`

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload

MIT