AIOmarRehan commited on
Commit
dc4e2c1
·
verified ·
1 Parent(s): a714fb5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +302 -0
README.md CHANGED
@@ -12,3 +12,305 @@ short_description: Audio classification with Mel-spectrogram CNNs.
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+ ---
17
+
18
+ # Audio-Classification-Raw-Audio-to-Mel-Spectrogram-CNNs
19
+ Complete end-to-end audio classification pipeline using deep learning. From raw recordings to Mel spectrogram CNNs, includes preprocessing, augmentation, dataset validation, model training, and evaluation — a reproducible blueprint for speech, environmental, or general sound classification tasks.
20
+
21
+ ---
22
+
23
+ # Audio Classification Pipeline — From Raw Audio to Mel-Spectrogram CNNs
24
+
25
+ > *“In machine learning, the model is rarely the problem — the data almost always is.”*
26
+ > — A reminder I kept repeating to myself while building this project.
27
+
28
+ This repository contains a complete, professional, end-to-end pipeline for **audio classification using deep learning**, starting from **raw, messy audio recordings** and ending with a fully trained **CNN model** using **Mel spectrograms**.
29
+
30
+ The workflow includes:
31
+
32
+ * Raw audio loading
33
+ * Cleaning & normalization
34
+ * Silence trimming
35
+ * Noise reduction
36
+ * Chunking
37
+ * Data augmentation
38
+ * Mel spectrogram generation
39
+ * Dataset validation
40
+ * CNN training
41
+ * Evaluation & metrics
42
+
43
+ It is a fully reproducible blueprint for real-world audio classification tasks.
44
+
45
+ ---
46
+
47
+ # Project Structure
48
+
49
+ Here is a quick table summarizing the core stages of the pipeline:
50
+
51
+ | Stage | Description | Output |
52
+ | ----------------------- | -------------------------------------- | ---------------- |
53
+ | **1. Raw Audio** | Unprocessed WAV/MP3 files | Audio dataset |
54
+ | **2. Preprocessing** | Trimming, cleaning, resampling | Cleaned signals |
55
+ | **3. Augmentation** | Pitch shift, time stretch, noise | Expanded dataset |
56
+ | **4. Mel Spectrograms** | Converts audio → images | PNG/IMG files |
57
+ | **5. CNN Training** | Deep model learns spectrogram patterns | `.h5` model |
58
+ | **6. Evaluation** | Accuracy, F1, Confusion Matrix | Metrics + plots |
59
+
60
+ ---
61
+
62
+ # 1. Loading & Inspecting Raw Audio
63
+
64
+ The dataset is loaded from directory structure:
65
+
66
+ ```python
67
+ paths = [(path.parts[-2], path.name, str(path))
68
+ for path in Path(extract_to).rglob('*.*')
69
+ if path.suffix.lower() in audio_extensions]
70
+
71
+ df = pd.DataFrame(paths, columns=['class', 'filename', 'full_path'])
72
+ df = df.sort_values('class').reset_index(drop=True)
73
+ ```
74
+
75
+ During EDA, I computed:
76
+
77
+ * Duration
78
+ * Sample rate
79
+ * Peak amplitude
80
+
81
+ And visualized duration distribution:
82
+
83
+ ```python
84
+ plt.hist(df['duration'], bins=30, edgecolor='black')
85
+ plt.xlabel("Duration (seconds)")
86
+ plt.ylabel("Number of recordings")
87
+ plt.title("Audio Duration Distribution")
88
+ plt.show()
89
+ ```
90
+
91
+ ---
92
+
93
+ # 2. Audio Cleaning & Normalization
94
+
95
+ Bad samples were removed, silent files filtered, and amplitudes normalized:
96
+
97
+ ```python
98
+ peak = np.abs(y).max()
99
+ if peak > 0:
100
+ y = y / peak * 0.99
101
+ ```
102
+
103
+ This ensures consistency and prevents the model from learning from corrupted audio.
104
+
105
+ ---
106
+
107
+ # 3. Advanced Preprocessing
108
+
109
+ Preprocessing included:
110
+
111
+ * Silence trimming
112
+ * Noise reduction
113
+ * Resampling → **16 kHz**
114
+ * Mono conversion
115
+ * 5-second chunking
116
+
117
+ ```python
118
+ TARGET_DURATION = 5.0
119
+ TARGET_SR = 16000
120
+ TARGET_LENGTH = int(TARGET_DURATION * TARGET_SR)
121
+ ```
122
+
123
+ Every audio file becomes a clean, consistent chunk ready for feature extraction.
124
+
125
+ ---
126
+
127
+ # 4. Audio Augmentation
128
+
129
+ To improve generalization, I applied augmentations:
130
+
131
+ ```python
132
+ augment = Compose([
133
+ Shift(min_shift=-0.3, max_shift=0.3, p=0.5),
134
+ PitchShift(min_semitones=-2, max_semitones=2, p=0.5),
135
+ TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),
136
+ AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5)
137
+ ])
138
+ ```
139
+
140
+ Every augmented file receives a unique name to avoid collisions.
141
+
142
+ ---
143
+
144
+ # 5. Mel Spectrogram Generation
145
+
146
+ Each cleaned audio chunk is transformed into a **Mel spectrogram**:
147
+
148
+ ```python
149
+ S = librosa.feature.melspectrogram(
150
+ y=y, sr=SR,
151
+ n_fft=N_FFT,
152
+ hop_length=HOP_LENGTH,
153
+ n_mels=N_MELS
154
+ )
155
+ S_dB = librosa.power_to_db(S, ref=np.max)
156
+ ```
157
+
158
+ * Output: **128×128 PNG images**
159
+ * Separate directories per class
160
+ * Supports both original & augmented samples
161
+
162
+ These images become the CNN input.
163
+
164
+ ### ***Example of Mel Spectrogram Images***
165
+
166
+ ![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F27304693%2Ffdf7046a261734cd8f503c8f448ca6ad%2Fdownload.png?generation=1763570826533634&alt=media)
167
+
168
+ ![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F27304693%2Fea53570ce051601192c90770091f7ceb%2Fdownload%20(1).png?generation=1763570855911665&alt=media)
169
+
170
+ ---
171
+
172
+ # 6. Dataset Validation
173
+
174
+ After spectrogram creation:
175
+
176
+ * Corrupted images removed
177
+ * Duplicate hashes filtered
178
+ * Filename integrity checked
179
+ * Class folders validated
180
+
181
+ ```python
182
+ df['file_hash'] = df['full_path'].apply(get_hash)
183
+ duplicate_hashes = df[df.duplicated(subset=['file_hash'], keep=False)]
184
+ ```
185
+
186
+ This step ensures **clean, reliable** training data.
187
+
188
+ ---
189
+
190
+ # 7. Building TensorFlow Datasets
191
+
192
+ The dataset is built with batching, caching, prefetching:
193
+
194
+ ```python
195
+ train_ds = tf.data.Dataset.from_tensor_slices((train_paths, train_labels))
196
+ train_ds = train_ds.map(load_and_preprocess, num_parallel_calls=AUTOTUNE)
197
+ train_ds = train_ds.shuffle(1024).batch(batch_size).prefetch(AUTOTUNE)
198
+ ```
199
+
200
+ I used a simple image-level augmentation pipeline:
201
+
202
+ ```python
203
+ data_augmentation = tf.keras.Sequential([
204
+ tf.keras.layers.InputLayer(input_shape=(231, 232, 4)),
205
+ tf.keras.layers.RandomFlip("horizontal"),
206
+ tf.keras.layers.RandomRotation(0.1),
207
+ tf.keras.layers.RandomZoom(0.1),
208
+ ])
209
+ ```
210
+
211
+ ---
212
+
213
+ # 8. CNN Architecture
214
+
215
+ The CNN captures deep frequency-time patterns across Mel images.
216
+
217
+ Key features:
218
+
219
+ * Multiple Conv2D + BatchNorm blocks
220
+ * Dropout
221
+ * L2 regularization
222
+ * Softmax output
223
+
224
+ ```python
225
+ model = Sequential([
226
+ data_augmentation,
227
+ Conv2D(32, (3,3), padding='same', activation='relu', kernel_regularizer=l2(weight_decay)),
228
+ BatchNormalization(),
229
+ MaxPooling2D((2,2)),
230
+ Dropout(0.2),
231
+ # ... more layers ...
232
+ Flatten(),
233
+ Dense(num_classes, activation='softmax')
234
+ ])
235
+ ```
236
+
237
+ ---
238
+
239
+ # 9. Training Strategy
240
+
241
+ ```python
242
+ reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=10)
243
+ early_stopping = EarlyStopping(monitor='val_loss', patience=40, restore_best_weights=True)
244
+
245
+ history = model.fit(
246
+ train_ds,
247
+ validation_data=val_ds,
248
+ epochs=50,
249
+ callbacks=[reduce_lr, early_stopping]
250
+ )
251
+ ```
252
+
253
+ The model converges smoothly while avoiding overfitting.
254
+
255
+ ---
256
+
257
+ # 10. Evaluation
258
+
259
+ Performance is evaluated using:
260
+
261
+ * Accuracy
262
+ * Precision, recall, F1-score
263
+ * Confusion matrix
264
+ * ROC/AUC curves
265
+
266
+ ```python
267
+ y_pred = np.argmax(model.predict(test_ds), axis=1)
268
+ print(classification_report(y_true, y_pred, target_names=le.classes_))
269
+ ```
270
+
271
+ Confusion matrix:
272
+
273
+ ```python
274
+ sns.heatmap(confusion_matrix(y_true, y_pred), annot=True, cmap='Blues')
275
+ plt.title("Confusion Matrix")
276
+ plt.show()
277
+ ```
278
+
279
+ ---
280
+
281
+ # 11. Saving the Model & Dataset
282
+
283
+ ```python
284
+ model.save("Audio_Model_Classification.h5")
285
+ shutil.make_archive("/content/spectrograms", 'zip', "/content/spectrograms")
286
+ ```
287
+
288
+ The entire spectrogram dataset is also zipped for sharing or deployment.
289
+
290
+ ---
291
+
292
+ # Final Notes
293
+
294
+ This project demonstrates:
295
+
296
+ * How to clean & prepare raw audio at a professional level
297
+ * Audio augmentation best practices
298
+ * How Mel spectrograms unlock CNN performance
299
+ * A full TensorFlow training pipeline
300
+ * Proper evaluation, reporting, and dataset integrity
301
+
302
+ If you're working on sound recognition, speech tasks, or environmental audio detection, this pipeline gives you a **complete production-grade foundation**.
303
+
304
+ ---
305
+
306
+ # **Results**
307
+ > **Note:** Click the image below to view the video showcasing the project’s results.
308
+ <a href="https://files.catbox.moe/suzziy.mp4">
309
+ <img src="https://images.unsplash.com/photo-1611162616475-46b635cb6868?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" width="400">
310
+ </a>
311
+
312
+ <hr style="border-bottom: 5px solid gray; margin-top: 10px;">
313
+
314
+ > **Note:** If the video above is not working, you can access it directly via the link below.
315
+
316
+ [Watch Demo Video](Results/Spectrogram_CNN_Audio_Classification.mp4)