extractor / README.md
redsky17's picture
Update README.md
fb34eff verified
---
title: Audio Processing Pipeline for TTS
emoji: 🎡
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
---
# Audio Processing Pipeline for TTS
Complete audio processing pipeline for TTS dataset creation.
## Features
- **Demucs Vocal Separation**: Extract clean vocals using state-of-the-art AI
- **Noise Reduction**: Remove background noise with spectral gating
- **Loudness Normalization**: Normalize to target dBFS
- **Resampling**: Convert to target sample rate (24kHz for TTS)
- **Mono Conversion**: Automatic stereo to mono
## Usage
1. Upload your audio file (WAV format, 44.1kHz recommended)
2. Configure options:
- Target sample rate (24kHz for TTS)
- Target loudness (-20 dBFS recommended)
- Enable/disable Demucs vocal separation
- Enable/disable noise reduction
3. Click "Process Audio"
4. Download the processed result
## Pipeline
```
Input WAV
↓
Demucs Vocal Separation (optional)
↓
Noise Reduction (optional)
↓
Loudness Normalization
↓
Resample to Target SR
↓
Convert to Mono
↓
Output Clean WAV
```
## Technical Details
- **Demucs Model**: htdemucs (hybrid transformer)
- **Denoising**: Spectral gating with noisereduce
- **Output Format**: Mono WAV, normalized loudness
- **GPU**: Supported for faster processing
## Next Steps
After processing your audio:
1. **Diarization**: Use Pyannote to separate speakers
2. **Transcription**: Use Whisper for text generation
3. **Dataset**: Package for TTS model training
## Tips
- Use 44.1kHz WAV input for best quality
- Enable Demucs for podcasts with music/background
- Enable denoise for noisy recordings
- 24kHz output is ideal for TTS training
- Processing takes ~30-60 seconds per 5 minutes (CPU mode)
## License
MIT
## Credits
- **Demucs**: Meta AI (Facebook Research)
- **noisereduce**: Tim Sainburg
- **PyTorch Audio**: PyTorch Team
- **Gradio**: Hugging Face