Spaces:
Sleeping
Sleeping
metadata
title: PrecisionVoice
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
PrecisionVoice - STT & Speaker Diarization
A production-ready Speech-to-Text and Speaker Diarization web application using FastAPI, faster-whisper, and pyannote.audio.
Features
- ποΈ Speech-to-Text using
erax-ai/EraX-WoW-Turbo-V1.1-CT2(8x faster, 8 Vietnamese dialects) - π₯ Speaker Diarization using
pyannote/speaker-diarization-3.1 - π§Ό Speech Enhancement using
SpeechBrain SepFormer DNS4(noise + reverb removal) - π Voice Activity Detection using
Silero VAD v5(prevents hallucination) - π€ Vocal Isolation using
MDX-Net(UVR-MDX-NET-Voc_FT) - π Automatic speaker-transcript alignment
- π₯ Download results in TXT or SRT format
- π³ Docker-ready with persistent model caching and GPU support
- π³ Docker-ready with persistent model caching and GPU support
Quick Start
Prerequisites
- Docker and Docker Compose
- (Optional) NVIDIA GPU with CUDA support
- HuggingFace account with access to pyannote models
Setup
Clone and configure:
cp .env.example .env # Edit .env and add your HuggingFace tokenBuild and run:
docker compose up --build
Audio Processing Pipeline
The system uses a state-of-the-art multi-stage pipeline to ensure maximum accuracy:
- Speech Enhancement: Background noise and reverb are removed using
SpeechBrain SepFormer(DNS4 Challenge winner). - Vocal Isolation: Vocals are separated from background music using
MDX-Net. - VAD Filtering: Silence is removed using
Silero VAD v5to prevent ASR hallucination. - Refinement: Highpass filtering and EBU R128 loudness normalization.
- Transcription: High-precision Vietnamese transcription using
PhoWhisper. - Diarization: Segmenting audio by speaker using
Pyannote 3.1. - Alignment: Merging transcripts with speaker segments + timestamp reconstruction.
Configuration
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
- | Required for Pyannote models |
ENABLE_SPEECH_ENHANCEMENT |
True |
Toggle SpeechBrain speech enhancement |
ENHANCEMENT_MODEL |
speechbrain/sepformer-dns4-16k-enhancement |
Model for speech enhancement |
ENABLE_SILERO_VAD |
True |
Toggle Silero VAD for hallucination prevention |
ENABLE_VOCAL_SEPARATION |
True |
Toggle MDX-Net vocal isolation |
MDX_MODEL |
UVR-MDX-NET-Voc_FT |
Model for vocal separation |
DEVICE |
auto |
cuda, cpu, or auto |
Development
Local Setup (without Docker)
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Web UI |
/api/transcribe |
POST | Upload and transcribe audio |
/api/download/{filename} |
GET | Download result files |
Supported Audio Formats
- MP3
- WAV
- M4A
- OGG
License
MIT