---
title: Audio Processing Pipeline for TTS
emoji: 🎵
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
---

# Audio Processing Pipeline for TTS

Complete audio processing pipeline for TTS dataset creation.

## Features

- **Demucs Vocal Separation**: Extract clean vocals using state-of-the-art AI
- **Noise Reduction**: Remove background noise with spectral gating
- **Loudness Normalization**: Normalize to target dBFS
- **Resampling**: Convert to target sample rate (24kHz for TTS)
- **Mono Conversion**: Automatic stereo to mono

## Usage

1. Upload your audio file (WAV format, 44.1kHz recommended)
2. Configure options:
   - Target sample rate (24kHz for TTS)
   - Target loudness (-20 dBFS recommended)
   - Enable/disable Demucs vocal separation
   - Enable/disable noise reduction
3. Click "Process Audio"
4. Download the processed result

## Pipeline

```
Input WAV
    ↓
Demucs Vocal Separation (optional)
    ↓
Noise Reduction (optional)
    ↓
Loudness Normalization
    ↓
Resample to Target SR
    ↓
Convert to Mono
    ↓
Output Clean WAV
```

## Technical Details

- **Demucs Model**: htdemucs (hybrid transformer)
- **Denoising**: Spectral gating with noisereduce
- **Output Format**: Mono WAV, normalized loudness
- **GPU**: Supported for faster processing

## Next Steps

After processing your audio:

1. **Diarization**: Use Pyannote to separate speakers
2. **Transcription**: Use Whisper for text generation
3. **Dataset**: Package for TTS model training

## Tips

- Use 44.1kHz WAV input for best quality
- Enable Demucs for podcasts with music/background
- Enable denoise for noisy recordings
- 24kHz output is ideal for TTS training
- Processing takes ~30-60 seconds per 5 minutes (CPU mode)

## License

MIT

## Credits

- **Demucs**: Meta AI (Facebook Research)
- **noisereduce**: Tim Sainburg
- **PyTorch Audio**: PyTorch Team
- **Gradio**: Hugging Face