Spaces:

vyluong
/

PoC_PrecisionVoice_test

Sleeping

PoC deployment

832e106 verified about 2 months ago

2.82 kB

title: PrecisionVoice
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false

PrecisionVoice - STT & Speaker Diarization

A production-ready Speech-to-Text and Speaker Diarization web application using FastAPI, faster-whisper, and pyannote.audio.

🎙️ Speech-to-Text using kiendt/PhoWhisper-large-ct2 (optimized for Vietnamese)
👥 Speaker Diarization using pyannote/speaker-diarization-3.1
🧼 Advanced Denoising using Facebook's Denoiser (dns64)
🎤 Vocal Isolation using MDX-Net (UVR-MDX-NET-Voc_FT)
🔄 Automatic speaker-transcript alignment
📥 Download results in TXT or SRT format
🐳 Docker-ready with persistent model caching and GPU support

Clone and configure:

cp .env.example .env
# Edit .env and add your HuggingFace token

The system uses a state-of-the-art multi-stage pipeline to ensure maximum accuracy:

Speech Enhancement: Background noise, hums, and interference are removed using Facebook's Denoiser (Deep Learning Wave-U-Net).
Vocal Isolation: Vocals are stripped from any remaining background music or non-speech sounds using MDX-Net.
Refinement: Subtle highpass filtering and EBU R128 loudness normalization for consistent volume.
Transcription: High-precision Vietnamese transcription using PhoWhisper.
Diarization: Segmenting audio by speaker.
Alignment: Merging transcripts with speaker segments.

Variable	Default	Description
`HF_TOKEN`	-	Required for Pyannote models
`ENABLE_DENOISER`	`True`	Toggle Facebook speech enhancement
`DENOISER_MODEL`	`dns64`	Model for denoising
`ENABLE_VOCAL_SEPARATION`	`True`	Toggle MDX-Net vocal isolation
`MDX_MODEL`	`UVR-MDX-NET-Voc_FT`	Model for vocal separation
`DEVICE`	`auto`	`cuda`, `cpu`, or `auto`

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload

MIT