|
|
--- |
|
|
title: Audio Processing Pipeline for TTS |
|
|
emoji: π΅ |
|
|
colorFrom: purple |
|
|
colorTo: blue |
|
|
sdk: gradio |
|
|
sdk_version: 5.49.1 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Audio Processing Pipeline for TTS |
|
|
|
|
|
Complete audio processing pipeline for TTS dataset creation. |
|
|
|
|
|
## Features |
|
|
|
|
|
- **Demucs Vocal Separation**: Extract clean vocals using state-of-the-art AI |
|
|
- **Noise Reduction**: Remove background noise with spectral gating |
|
|
- **Loudness Normalization**: Normalize to target dBFS |
|
|
- **Resampling**: Convert to target sample rate (24kHz for TTS) |
|
|
- **Mono Conversion**: Automatic stereo to mono |
|
|
|
|
|
## Usage |
|
|
|
|
|
1. Upload your audio file (WAV format, 44.1kHz recommended) |
|
|
2. Configure options: |
|
|
- Target sample rate (24kHz for TTS) |
|
|
- Target loudness (-20 dBFS recommended) |
|
|
- Enable/disable Demucs vocal separation |
|
|
- Enable/disable noise reduction |
|
|
3. Click "Process Audio" |
|
|
4. Download the processed result |
|
|
|
|
|
## Pipeline |
|
|
|
|
|
``` |
|
|
Input WAV |
|
|
β |
|
|
Demucs Vocal Separation (optional) |
|
|
β |
|
|
Noise Reduction (optional) |
|
|
β |
|
|
Loudness Normalization |
|
|
β |
|
|
Resample to Target SR |
|
|
β |
|
|
Convert to Mono |
|
|
β |
|
|
Output Clean WAV |
|
|
``` |
|
|
|
|
|
## Technical Details |
|
|
|
|
|
- **Demucs Model**: htdemucs (hybrid transformer) |
|
|
- **Denoising**: Spectral gating with noisereduce |
|
|
- **Output Format**: Mono WAV, normalized loudness |
|
|
- **GPU**: Supported for faster processing |
|
|
|
|
|
## Next Steps |
|
|
|
|
|
After processing your audio: |
|
|
|
|
|
1. **Diarization**: Use Pyannote to separate speakers |
|
|
2. **Transcription**: Use Whisper for text generation |
|
|
3. **Dataset**: Package for TTS model training |
|
|
|
|
|
## Tips |
|
|
|
|
|
- Use 44.1kHz WAV input for best quality |
|
|
- Enable Demucs for podcasts with music/background |
|
|
- Enable denoise for noisy recordings |
|
|
- 24kHz output is ideal for TTS training |
|
|
- Processing takes ~30-60 seconds per 5 minutes (CPU mode) |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|
|
|
## Credits |
|
|
|
|
|
- **Demucs**: Meta AI (Facebook Research) |
|
|
- **noisereduce**: Tim Sainburg |
|
|
- **PyTorch Audio**: PyTorch Team |
|
|
- **Gradio**: Hugging Face |