File size: 1,952 Bytes
7a6b0ca
78e6f13
 
 
 
7a6b0ca
fb34eff
7a6b0ca
 
78e6f13
7a6b0ca
 
78e6f13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb34eff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
title: Audio Processing Pipeline for TTS
emoji: 🎡
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
---

# Audio Processing Pipeline for TTS

Complete audio processing pipeline for TTS dataset creation.

## Features

- **Demucs Vocal Separation**: Extract clean vocals using state-of-the-art AI
- **Noise Reduction**: Remove background noise with spectral gating
- **Loudness Normalization**: Normalize to target dBFS
- **Resampling**: Convert to target sample rate (24kHz for TTS)
- **Mono Conversion**: Automatic stereo to mono

## Usage

1. Upload your audio file (WAV format, 44.1kHz recommended)
2. Configure options:
   - Target sample rate (24kHz for TTS)
   - Target loudness (-20 dBFS recommended)
   - Enable/disable Demucs vocal separation
   - Enable/disable noise reduction
3. Click "Process Audio"
4. Download the processed result

## Pipeline

```
Input WAV
    ↓
Demucs Vocal Separation (optional)
    ↓
Noise Reduction (optional)
    ↓
Loudness Normalization
    ↓
Resample to Target SR
    ↓
Convert to Mono
    ↓
Output Clean WAV
```

## Technical Details

- **Demucs Model**: htdemucs (hybrid transformer)
- **Denoising**: Spectral gating with noisereduce
- **Output Format**: Mono WAV, normalized loudness
- **GPU**: Supported for faster processing

## Next Steps

After processing your audio:

1. **Diarization**: Use Pyannote to separate speakers
2. **Transcription**: Use Whisper for text generation
3. **Dataset**: Package for TTS model training

## Tips

- Use 44.1kHz WAV input for best quality
- Enable Demucs for podcasts with music/background
- Enable denoise for noisy recordings
- 24kHz output is ideal for TTS training
- Processing takes ~30-60 seconds per 5 minutes (CPU mode)

## License

MIT

## Credits

- **Demucs**: Meta AI (Facebook Research)
- **noisereduce**: Tim Sainburg
- **PyTorch Audio**: PyTorch Team
- **Gradio**: Hugging Face