redsky17 commited on
Commit
78e6f13
Β·
verified Β·
1 Parent(s): 7a6b0ca

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +81 -8
  2. app.py +267 -0
  3. requirements.txt +9 -0
README.md CHANGED
@@ -1,14 +1,87 @@
1
  ---
2
- title: Extractor
3
- emoji: πŸ“Š
4
- colorFrom: red
5
- colorTo: yellow
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
10
- license: apache-2.0
11
- short_description: remove noise from audio
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Audio Processing Pipeline for TTS
3
+ emoji: 🎡
4
+ colorFrom: purple
5
+ colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
 
11
  ---
12
 
13
+ # Audio Processing Pipeline for TTS
14
+
15
+ Complete audio processing pipeline for TTS dataset creation.
16
+
17
+ ## Features
18
+
19
+ - **Demucs Vocal Separation**: Extract clean vocals using state-of-the-art AI
20
+ - **Noise Reduction**: Remove background noise with spectral gating
21
+ - **Loudness Normalization**: Normalize to target dBFS
22
+ - **Resampling**: Convert to target sample rate (24kHz for TTS)
23
+ - **Mono Conversion**: Automatic stereo to mono
24
+
25
+ ## Usage
26
+
27
+ 1. Upload your audio file (WAV format, 44.1kHz recommended)
28
+ 2. Configure options:
29
+ - Target sample rate (24kHz for TTS)
30
+ - Target loudness (-20 dBFS recommended)
31
+ - Enable/disable Demucs vocal separation
32
+ - Enable/disable noise reduction
33
+ 3. Click "Process Audio"
34
+ 4. Download the processed result
35
+
36
+ ## Pipeline
37
+
38
+ ```
39
+ Input WAV
40
+ ↓
41
+ Demucs Vocal Separation (optional)
42
+ ↓
43
+ Noise Reduction (optional)
44
+ ↓
45
+ Loudness Normalization
46
+ ↓
47
+ Resample to Target SR
48
+ ↓
49
+ Convert to Mono
50
+ ↓
51
+ Output Clean WAV
52
+ ```
53
+
54
+ ## Technical Details
55
+
56
+ - **Demucs Model**: htdemucs (hybrid transformer)
57
+ - **Denoising**: Spectral gating with noisereduce
58
+ - **Output Format**: Mono WAV, normalized loudness
59
+ - **GPU**: Supported for faster processing
60
+
61
+ ## Next Steps
62
+
63
+ After processing your audio:
64
+
65
+ 1. **Diarization**: Use Pyannote to separate speakers
66
+ 2. **Transcription**: Use Whisper for text generation
67
+ 3. **Dataset**: Package for TTS model training
68
+
69
+ ## Tips
70
+
71
+ - Use 44.1kHz WAV input for best quality
72
+ - Enable Demucs for podcasts with music/background
73
+ - Enable denoise for noisy recordings
74
+ - 24kHz output is ideal for TTS training
75
+ - Processing takes ~30-60 seconds per 5 minutes (CPU mode)
76
+
77
+ ## License
78
+
79
+ MIT
80
+
81
+ ## Credits
82
+
83
+ - **Demucs**: Meta AI (Facebook Research)
84
+ - **noisereduce**: Tim Sainburg
85
+ - **PyTorch Audio**: PyTorch Team
86
+ - **Gradio**: Hugging Face
87
+
app.py ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Gradio App for Hugging Face Spaces
4
+ Audio Processing Pipeline: Demucs + Denoise + Normalize + Resample
5
+ """
6
+
7
+ import gradio as gr
8
+ import torch
9
+ import torchaudio
10
+ import os
11
+ import tempfile
12
+ from pathlib import Path
13
+
14
+ print("Loading dependencies...")
15
+
16
+ # Check device
17
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
18
+ print(f"Device: {DEVICE}")
19
+
20
+ def separate_vocals_demucs(audio_path, device="cpu"):
21
+ """Extract vocals using Demucs"""
22
+ from demucs.pretrained import get_model
23
+ from demucs.apply import apply_model
24
+
25
+ # Load model
26
+ model = get_model('htdemucs')
27
+ model.to(device)
28
+ model.eval()
29
+
30
+ # Load audio
31
+ wav, sr = torchaudio.load(audio_path)
32
+
33
+ # Resample to 44.1kHz if needed
34
+ if sr != 44100:
35
+ wav = torchaudio.transforms.Resample(sr, 44100)(wav)
36
+ sr = 44100
37
+
38
+ # Process
39
+ wav = wav.to(device)
40
+ if wav.dim() == 2:
41
+ wav = wav.unsqueeze(0)
42
+
43
+ with torch.no_grad():
44
+ sources = apply_model(model, wav, device=device)
45
+
46
+ # Extract vocals
47
+ vocals_idx = model.sources.index('vocals')
48
+ vocals = sources[0, vocals_idx].cpu()
49
+
50
+ return vocals, sr
51
+
52
+
53
+ def denoise_audio(audio, sr):
54
+ """Apply noise reduction"""
55
+ try:
56
+ import noisereduce as nr
57
+ audio_np = audio.squeeze().numpy()
58
+ reduced = nr.reduce_noise(
59
+ y=audio_np,
60
+ sr=sr,
61
+ stationary=True,
62
+ prop_decrease=1.0,
63
+ freq_mask_smooth_hz=500,
64
+ time_mask_smooth_ms=50
65
+ )
66
+ audio = torch.from_numpy(reduced).unsqueeze(0).float()
67
+ except Exception as e:
68
+ print(f"Denoising skipped: {e}")
69
+ return audio
70
+
71
+
72
+ def normalize_loudness(audio, target_dbfs=-20.0):
73
+ """Normalize to target loudness"""
74
+ rms = torch.sqrt(torch.mean(audio ** 2))
75
+ if rms > 0:
76
+ current_dbfs = 20 * torch.log10(rms)
77
+ gain_db = target_dbfs - current_dbfs
78
+ gain_linear = 10 ** (gain_db / 20)
79
+ audio = audio * gain_linear
80
+ audio = torch.clamp(audio, -1.0, 1.0)
81
+ return audio
82
+
83
+
84
+ def convert_to_mono(audio):
85
+ """Convert to mono"""
86
+ if audio.shape[0] > 1:
87
+ audio = torch.mean(audio, dim=0, keepdim=True)
88
+ return audio
89
+
90
+
91
+ def process_audio(
92
+ input_file,
93
+ target_sr,
94
+ target_dbfs,
95
+ use_demucs,
96
+ use_denoise,
97
+ progress=gr.Progress()
98
+ ):
99
+ """Complete audio processing pipeline"""
100
+
101
+ if input_file is None:
102
+ return None, "❌ Please upload an audio file"
103
+
104
+ try:
105
+ progress(0.1, desc="Loading audio...")
106
+
107
+ # Step 1: Vocal separation (optional)
108
+ if use_demucs:
109
+ progress(0.2, desc="Separating vocals with Demucs...")
110
+ audio, sr = separate_vocals_demucs(input_file, DEVICE)
111
+ else:
112
+ audio, sr = torchaudio.load(input_file)
113
+
114
+ # Step 2: Convert to mono
115
+ progress(0.5, desc="Converting to mono...")
116
+ audio = convert_to_mono(audio)
117
+
118
+ # Step 3: Denoise (optional)
119
+ if use_denoise:
120
+ progress(0.6, desc="Removing noise...")
121
+ audio = denoise_audio(audio, sr)
122
+
123
+ # Step 4: Normalize
124
+ progress(0.7, desc="Normalizing loudness...")
125
+ audio = normalize_loudness(audio, target_dbfs)
126
+
127
+ # Step 5: Resample
128
+ if sr != target_sr:
129
+ progress(0.8, desc=f"Resampling to {target_sr} Hz...")
130
+ resampler = torchaudio.transforms.Resample(sr, target_sr)
131
+ audio = resampler(audio)
132
+ sr = target_sr
133
+
134
+ # Save output
135
+ progress(0.9, desc="Saving output...")
136
+ output_path = tempfile.NamedTemporaryFile(delete=False, suffix=".wav").name
137
+ torchaudio.save(output_path, audio, sr)
138
+
139
+ # Get info
140
+ duration = audio.shape[1] / sr
141
+ size_mb = os.path.getsize(output_path) / (1024 * 1024)
142
+
143
+ info = f"""
144
+ βœ… **Processing Complete!**
145
+
146
+ πŸ“Š **Output Info:**
147
+ - Duration: {duration:.1f} seconds
148
+ - Sample Rate: {sr} Hz
149
+ - Channels: {audio.shape[0]} (mono)
150
+ - Size: {size_mb:.2f} MB
151
+ - Loudness: {target_dbfs} dBFS
152
+
153
+ 🎡 **Pipeline Steps:**
154
+ {"βœ“ Demucs vocal separation" if use_demucs else "βŠ— Skipped vocal separation"}
155
+ {"βœ“ Noise reduction" if use_denoise else "βŠ— Skipped noise reduction"}
156
+ βœ“ Loudness normalization
157
+ βœ“ Resampled to {target_sr} Hz
158
+ βœ“ Converted to mono
159
+ """
160
+
161
+ progress(1.0, desc="Done!")
162
+ return output_path, info
163
+
164
+ except Exception as e:
165
+ import traceback
166
+ error_msg = f"❌ **Error:** {str(e)}\n\n```\n{traceback.format_exc()}\n```"
167
+ return None, error_msg
168
+
169
+
170
+ # Create Gradio interface
171
+ with gr.Blocks(theme=gr.themes.Soft()) as demo:
172
+ gr.Markdown("""
173
+ # 🎡 Audio Processing Pipeline for TTS
174
+
175
+ Extract clean vocals from podcasts/audio for TTS training
176
+
177
+ **Pipeline:** Demucs Vocal Separation β†’ Denoise β†’ Normalize β†’ Resample β†’ Mono
178
+ """)
179
+
180
+ with gr.Row():
181
+ with gr.Column(scale=1):
182
+ gr.Markdown("### πŸ“ Input")
183
+ input_audio = gr.Audio(
184
+ label="Upload Audio (WAV format, 44.1kHz recommended)",
185
+ type="filepath"
186
+ )
187
+
188
+ gr.Markdown("### βš™οΈ Options")
189
+
190
+ target_sr = gr.Radio(
191
+ choices=[16000, 22050, 24000, 44100, 48000],
192
+ value=24000,
193
+ label="Target Sample Rate",
194
+ info="24kHz recommended for TTS"
195
+ )
196
+
197
+ target_dbfs = gr.Slider(
198
+ minimum=-40,
199
+ maximum=0,
200
+ value=-20,
201
+ step=1,
202
+ label="Target Loudness (dBFS)",
203
+ info="Normalization level (-20 recommended)"
204
+ )
205
+
206
+ use_demucs = gr.Checkbox(
207
+ value=True,
208
+ label="Use Demucs Vocal Separation",
209
+ info="Extracts clean vocals (slower but better)"
210
+ )
211
+
212
+ use_denoise = gr.Checkbox(
213
+ value=True,
214
+ label="Apply Noise Reduction",
215
+ info="Remove background noise"
216
+ )
217
+
218
+ process_btn = gr.Button("πŸš€ Process Audio", variant="primary", size="lg")
219
+
220
+ with gr.Column(scale=1):
221
+ gr.Markdown("### πŸ“₯ Output")
222
+ output_audio = gr.Audio(
223
+ label="Processed Audio",
224
+ type="filepath"
225
+ )
226
+ output_info = gr.Markdown("Upload audio and click 'Process Audio' to start")
227
+
228
+ gr.Markdown("""
229
+ ---
230
+ ### πŸ“– Usage Tips
231
+
232
+ - **Input:** Upload WAV files (44.1kHz recommended for best quality)
233
+ - **Demucs:** Enable for podcasts with music/background sounds
234
+ - **Denoise:** Enable for noisy recordings
235
+ - **Sample Rate:** Use 24kHz for TTS training, 16kHz for ASR
236
+ - **Processing Time:** ~30-60 seconds for 5-minute audio (CPU mode)
237
+
238
+ ### πŸ”§ Technical Details
239
+
240
+ - **Device:** {} {}
241
+ - **Demucs Model:** htdemucs (hybrid transformer)
242
+ - **Denoise:** Spectral gating with noisereduce
243
+ - **Output:** Mono WAV, normalized loudness
244
+
245
+ ### πŸ’‘ Next Steps
246
+
247
+ After processing:
248
+ 1. Download the clean audio
249
+ 2. Use Pyannote for speaker diarization
250
+ 3. Use Whisper for transcription
251
+ 4. Package as TTS training dataset
252
+
253
+ ---
254
+ Made with ❀️ for TTS dataset creation
255
+ """.format(DEVICE, torch.cuda.get_device_name(0) if DEVICE == "cuda" else ""))
256
+
257
+ # Connect button
258
+ process_btn.click(
259
+ fn=process_audio,
260
+ inputs=[input_audio, target_sr, target_dbfs, use_demucs, use_denoise],
261
+ outputs=[output_audio, output_info]
262
+ )
263
+
264
+ if __name__ == "__main__":
265
+ print("Starting Gradio app...")
266
+ demo.launch()
267
+
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ torch>=2.0.0
2
+ torchaudio>=2.0.0
3
+ gradio>=4.0.0
4
+ demucs>=4.0.0
5
+ noisereduce>=3.0.0
6
+ numpy>=1.24.0
7
+ scipy>=1.10.0
8
+ soundfile>=0.12.0
9
+