Spaces:
Sleeping
Sleeping
metadata
title: PrecisionVoice
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
PrecisionVoice - STT & Speaker Diarization
A production-ready Speech-to-Text and Speaker Diarization web application using FastAPI, faster-whisper, and pyannote.audio.
Features
- ποΈ Speech-to-Text using
kiendt/PhoWhisper-large-ct2(optimized for Vietnamese) - π₯ Speaker Diarization using
pyannote/speaker-diarization-3.1 - π§Ό Advanced Denoising using Facebook's
Denoiser(dns64) - π€ Vocal Isolation using
MDX-Net(UVR-MDX-NET-Voc_FT) - π Automatic speaker-transcript alignment
- π₯ Download results in TXT or SRT format
- π³ Docker-ready with persistent model caching and GPU support
Quick Start
Prerequisites
- Docker and Docker Compose
- (Optional) NVIDIA GPU with CUDA support
- HuggingFace account with access to pyannote models
Setup
Clone and configure:
cp .env.example .env # Edit .env and add your HuggingFace tokenBuild and run:
docker compose up --build
Audio Processing Pipeline
The system uses a state-of-the-art multi-stage pipeline to ensure maximum accuracy:
- Speech Enhancement: Background noise, hums, and interference are removed using Facebook's
Denoiser(Deep Learning Wave-U-Net). - Vocal Isolation: Vocals are stripped from any remaining background music or non-speech sounds using
MDX-Net. - Refinement: Subtle highpass filtering and EBU R128 loudness normalization for consistent volume.
- Transcription: High-precision Vietnamese transcription using
PhoWhisper. - Diarization: Segmenting audio by speaker.
- Alignment: Merging transcripts with speaker segments.
Configuration
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
- | Required for Pyannote models |
ENABLE_DENOISER |
True |
Toggle Facebook speech enhancement |
DENOISER_MODEL |
dns64 |
Model for denoising |
ENABLE_VOCAL_SEPARATION |
True |
Toggle MDX-Net vocal isolation |
MDX_MODEL |
UVR-MDX-NET-Voc_FT |
Model for vocal separation |
DEVICE |
auto |
cuda, cpu, or auto |
Development
Local Setup (without Docker)
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Web UI |
/api/transcribe |
POST | Upload and transcribe audio |
/api/download/{filename} |
GET | Download result files |
Supported Audio Formats
- MP3
- WAV
- M4A
- OGG
License
MIT