File size: 1,952 Bytes
7a6b0ca 78e6f13 7a6b0ca fb34eff 7a6b0ca 78e6f13 7a6b0ca 78e6f13 fb34eff |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
title: Audio Processing Pipeline for TTS
emoji: π΅
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
---
# Audio Processing Pipeline for TTS
Complete audio processing pipeline for TTS dataset creation.
## Features
- **Demucs Vocal Separation**: Extract clean vocals using state-of-the-art AI
- **Noise Reduction**: Remove background noise with spectral gating
- **Loudness Normalization**: Normalize to target dBFS
- **Resampling**: Convert to target sample rate (24kHz for TTS)
- **Mono Conversion**: Automatic stereo to mono
## Usage
1. Upload your audio file (WAV format, 44.1kHz recommended)
2. Configure options:
- Target sample rate (24kHz for TTS)
- Target loudness (-20 dBFS recommended)
- Enable/disable Demucs vocal separation
- Enable/disable noise reduction
3. Click "Process Audio"
4. Download the processed result
## Pipeline
```
Input WAV
β
Demucs Vocal Separation (optional)
β
Noise Reduction (optional)
β
Loudness Normalization
β
Resample to Target SR
β
Convert to Mono
β
Output Clean WAV
```
## Technical Details
- **Demucs Model**: htdemucs (hybrid transformer)
- **Denoising**: Spectral gating with noisereduce
- **Output Format**: Mono WAV, normalized loudness
- **GPU**: Supported for faster processing
## Next Steps
After processing your audio:
1. **Diarization**: Use Pyannote to separate speakers
2. **Transcription**: Use Whisper for text generation
3. **Dataset**: Package for TTS model training
## Tips
- Use 44.1kHz WAV input for best quality
- Enable Demucs for podcasts with music/background
- Enable denoise for noisy recordings
- 24kHz output is ideal for TTS training
- Processing takes ~30-60 seconds per 5 minutes (CPU mode)
## License
MIT
## Credits
- **Demucs**: Meta AI (Facebook Research)
- **noisereduce**: Tim Sainburg
- **PyTorch Audio**: PyTorch Team
- **Gradio**: Hugging Face |