A newer version of the Gradio SDK is available:
6.5.1
metadata
title: Audio Processing Pipeline for TTS
emoji: π΅
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
Audio Processing Pipeline for TTS
Complete audio processing pipeline for TTS dataset creation.
Features
- Demucs Vocal Separation: Extract clean vocals using state-of-the-art AI
- Noise Reduction: Remove background noise with spectral gating
- Loudness Normalization: Normalize to target dBFS
- Resampling: Convert to target sample rate (24kHz for TTS)
- Mono Conversion: Automatic stereo to mono
Usage
- Upload your audio file (WAV format, 44.1kHz recommended)
- Configure options:
- Target sample rate (24kHz for TTS)
- Target loudness (-20 dBFS recommended)
- Enable/disable Demucs vocal separation
- Enable/disable noise reduction
- Click "Process Audio"
- Download the processed result
Pipeline
Input WAV
β
Demucs Vocal Separation (optional)
β
Noise Reduction (optional)
β
Loudness Normalization
β
Resample to Target SR
β
Convert to Mono
β
Output Clean WAV
Technical Details
- Demucs Model: htdemucs (hybrid transformer)
- Denoising: Spectral gating with noisereduce
- Output Format: Mono WAV, normalized loudness
- GPU: Supported for faster processing
Next Steps
After processing your audio:
- Diarization: Use Pyannote to separate speakers
- Transcription: Use Whisper for text generation
- Dataset: Package for TTS model training
Tips
- Use 44.1kHz WAV input for best quality
- Enable Demucs for podcasts with music/background
- Enable denoise for noisy recordings
- 24kHz output is ideal for TTS training
- Processing takes ~30-60 seconds per 5 minutes (CPU mode)
License
MIT
Credits
- Demucs: Meta AI (Facebook Research)
- noisereduce: Tim Sainburg
- PyTorch Audio: PyTorch Team
- Gradio: Hugging Face