File size: 2,821 Bytes
feafc91
832e106
 
feafc91
832e106
feafc91
832e106
feafc91
 
 
 
832e106
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
title: PrecisionVoice
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app/main.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# PrecisionVoice - STT & Speaker Diarization

A production-ready Speech-to-Text and Speaker Diarization web application using FastAPI, faster-whisper, and pyannote.audio.

## Features

- πŸŽ™οΈ Speech-to-Text using `kiendt/PhoWhisper-large-ct2` (optimized for Vietnamese)
- πŸ‘₯ Speaker Diarization using `pyannote/speaker-diarization-3.1`
- 🧼 Advanced Denoising using Facebook's `Denoiser` (dns64)
- 🎀 Vocal Isolation using `MDX-Net` (UVR-MDX-NET-Voc_FT)
- πŸ”„ Automatic speaker-transcript alignment
- πŸ“₯ Download results in TXT or SRT format
- 🐳 Docker-ready with persistent model caching and GPU support

## Quick Start

### Prerequisites

1. Docker and Docker Compose
2. (Optional) NVIDIA GPU with CUDA support
3. HuggingFace account with access to pyannote models

### Setup

1. Clone and configure:
   ```bash
   cp .env.example .env
   # Edit .env and add your HuggingFace token
   ```

2. Build and run:
   ```bash
   docker compose up --build
   ```

3. Open http://localhost:8000

## Audio Processing Pipeline

The system uses a state-of-the-art multi-stage pipeline to ensure maximum accuracy:

1. **Speech Enhancement**: Background noise, hums, and interference are removed using Facebook's `Denoiser` (Deep Learning Wave-U-Net).
2. **Vocal Isolation**: Vocals are stripped from any remaining background music or non-speech sounds using `MDX-Net`.
3. **Refinement**: Subtle highpass filtering and EBU R128 loudness normalization for consistent volume.
4. **Transcription**: High-precision Vietnamese transcription using `PhoWhisper`.
5. **Diarization**: Segmenting audio by speaker.
6. **Alignment**: Merging transcripts with speaker segments.

## Configuration

| Variable | Default | Description |
|----------|---------|-------------|
| `HF_TOKEN` | - | Required for Pyannote models |
| `ENABLE_DENOISER` | `True` | Toggle Facebook speech enhancement |
| `DENOISER_MODEL` | `dns64` | Model for denoising |
| `ENABLE_VOCAL_SEPARATION` | `True` | Toggle MDX-Net vocal isolation |
| `MDX_MODEL` | `UVR-MDX-NET-Voc_FT` | Model for vocal separation |
| `DEVICE` | `auto` | `cuda`, `cpu`, or `auto` |

## Development

### Local Setup (without Docker)

```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload
```

### API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Web UI |
| `/api/transcribe` | POST | Upload and transcribe audio |
| `/api/download/{filename}` | GET | Download result files |

## Supported Audio Formats

- MP3
- WAV
- M4A
- OGG

## License

MIT