extractor / README.md
redsky17's picture
Update README.md
fb34eff verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Audio Processing Pipeline for TTS
emoji: 🎡
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit

Audio Processing Pipeline for TTS

Complete audio processing pipeline for TTS dataset creation.

Features

  • Demucs Vocal Separation: Extract clean vocals using state-of-the-art AI
  • Noise Reduction: Remove background noise with spectral gating
  • Loudness Normalization: Normalize to target dBFS
  • Resampling: Convert to target sample rate (24kHz for TTS)
  • Mono Conversion: Automatic stereo to mono

Usage

  1. Upload your audio file (WAV format, 44.1kHz recommended)
  2. Configure options:
    • Target sample rate (24kHz for TTS)
    • Target loudness (-20 dBFS recommended)
    • Enable/disable Demucs vocal separation
    • Enable/disable noise reduction
  3. Click "Process Audio"
  4. Download the processed result

Pipeline

Input WAV
    ↓
Demucs Vocal Separation (optional)
    ↓
Noise Reduction (optional)
    ↓
Loudness Normalization
    ↓
Resample to Target SR
    ↓
Convert to Mono
    ↓
Output Clean WAV

Technical Details

  • Demucs Model: htdemucs (hybrid transformer)
  • Denoising: Spectral gating with noisereduce
  • Output Format: Mono WAV, normalized loudness
  • GPU: Supported for faster processing

Next Steps

After processing your audio:

  1. Diarization: Use Pyannote to separate speakers
  2. Transcription: Use Whisper for text generation
  3. Dataset: Package for TTS model training

Tips

  • Use 44.1kHz WAV input for best quality
  • Enable Demucs for podcasts with music/background
  • Enable denoise for noisy recordings
  • 24kHz output is ideal for TTS training
  • Processing takes ~30-60 seconds per 5 minutes (CPU mode)

License

MIT

Credits

  • Demucs: Meta AI (Facebook Research)
  • noisereduce: Tim Sainburg
  • PyTorch Audio: PyTorch Team
  • Gradio: Hugging Face