--- title: Audio Processing Pipeline for TTS emoji: 🎵 colorFrom: purple colorTo: blue sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false license: mit --- # Audio Processing Pipeline for TTS Complete audio processing pipeline for TTS dataset creation. ## Features - **Demucs Vocal Separation**: Extract clean vocals using state-of-the-art AI - **Noise Reduction**: Remove background noise with spectral gating - **Loudness Normalization**: Normalize to target dBFS - **Resampling**: Convert to target sample rate (24kHz for TTS) - **Mono Conversion**: Automatic stereo to mono ## Usage 1. Upload your audio file (WAV format, 44.1kHz recommended) 2. Configure options: - Target sample rate (24kHz for TTS) - Target loudness (-20 dBFS recommended) - Enable/disable Demucs vocal separation - Enable/disable noise reduction 3. Click "Process Audio" 4. Download the processed result ## Pipeline ``` Input WAV ↓ Demucs Vocal Separation (optional) ↓ Noise Reduction (optional) ↓ Loudness Normalization ↓ Resample to Target SR ↓ Convert to Mono ↓ Output Clean WAV ``` ## Technical Details - **Demucs Model**: htdemucs (hybrid transformer) - **Denoising**: Spectral gating with noisereduce - **Output Format**: Mono WAV, normalized loudness - **GPU**: Supported for faster processing ## Next Steps After processing your audio: 1. **Diarization**: Use Pyannote to separate speakers 2. **Transcription**: Use Whisper for text generation 3. **Dataset**: Package for TTS model training ## Tips - Use 44.1kHz WAV input for best quality - Enable Demucs for podcasts with music/background - Enable denoise for noisy recordings - 24kHz output is ideal for TTS training - Processing takes ~30-60 seconds per 5 minutes (CPU mode) ## License MIT ## Credits - **Demucs**: Meta AI (Facebook Research) - **noisereduce**: Tim Sainburg - **PyTorch Audio**: PyTorch Team - **Gradio**: Hugging Face