AIOmarRehan commited on
Commit
6a70fed
·
verified ·
1 Parent(s): 83fc726

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +307 -3
README.md CHANGED
@@ -1,3 +1,307 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [If you would like a detailed explanation of this project, please refer to the Medium article below.](https://medium.com/@ai.omar.rehan/building-a-complete-audio-classification-pipeline-using-deep-learning-from-raw-audio-to-mel-9894bd438d85)
2
+
3
+ ---
4
+
5
+ [The project is also available for testing on Hugging Face.](https://huggingface.co/spaces/AIOmarRehan/Deep_Audio_Classifier_using_CNN)
6
+
7
+ ---
8
+
9
+ # Audio-Classification-Raw-Audio-to-Mel-Spectrogram-CNNs
10
+ Complete end-to-end audio classification pipeline using deep learning. From raw recordings to Mel spectrogram CNNs, includes preprocessing, augmentation, dataset validation, model training, and evaluation — a reproducible blueprint for speech, environmental, or general sound classification tasks.
11
+
12
+ ---
13
+
14
+ # Audio Classification Pipeline — From Raw Audio to Mel-Spectrogram CNNs
15
+
16
+ > *“In machine learning, the model is rarely the problem — the data almost always is.”*
17
+ > — A reminder I kept repeating to myself while building this project.
18
+
19
+ This repository contains a complete, professional, end-to-end pipeline for **audio classification using deep learning**, starting from **raw, messy audio recordings** and ending with a fully trained **CNN model** using **Mel spectrograms**.
20
+
21
+ The workflow includes:
22
+
23
+ * Raw audio loading
24
+ * Cleaning & normalization
25
+ * Silence trimming
26
+ * Noise reduction
27
+ * Chunking
28
+ * Data augmentation
29
+ * Mel spectrogram generation
30
+ * Dataset validation
31
+ * CNN training
32
+ * Evaluation & metrics
33
+
34
+ It is a fully reproducible blueprint for real-world audio classification tasks.
35
+
36
+ ---
37
+
38
+ # Project Structure
39
+
40
+ Here is a quick table summarizing the core stages of the pipeline:
41
+
42
+ | Stage | Description | Output |
43
+ | ----------------------- | -------------------------------------- | ---------------- |
44
+ | **1. Raw Audio** | Unprocessed WAV/MP3 files | Audio dataset |
45
+ | **2. Preprocessing** | Trimming, cleaning, resampling | Cleaned signals |
46
+ | **3. Augmentation** | Pitch shift, time stretch, noise | Expanded dataset |
47
+ | **4. Mel Spectrograms** | Converts audio → images | PNG/IMG files |
48
+ | **5. CNN Training** | Deep model learns spectrogram patterns | `.h5` model |
49
+ | **6. Evaluation** | Accuracy, F1, Confusion Matrix | Metrics + plots |
50
+
51
+ ---
52
+
53
+ # 1. Loading & Inspecting Raw Audio
54
+
55
+ The dataset is loaded from directory structure:
56
+
57
+ ```python
58
+ paths = [(path.parts[-2], path.name, str(path))
59
+ for path in Path(extract_to).rglob('*.*')
60
+ if path.suffix.lower() in audio_extensions]
61
+
62
+ df = pd.DataFrame(paths, columns=['class', 'filename', 'full_path'])
63
+ df = df.sort_values('class').reset_index(drop=True)
64
+ ```
65
+
66
+ During EDA, I computed:
67
+
68
+ * Duration
69
+ * Sample rate
70
+ * Peak amplitude
71
+
72
+ And visualized duration distribution:
73
+
74
+ ```python
75
+ plt.hist(df['duration'], bins=30, edgecolor='black')
76
+ plt.xlabel("Duration (seconds)")
77
+ plt.ylabel("Number of recordings")
78
+ plt.title("Audio Duration Distribution")
79
+ plt.show()
80
+ ```
81
+
82
+ ---
83
+
84
+ # 2. Audio Cleaning & Normalization
85
+
86
+ Bad samples were removed, silent files filtered, and amplitudes normalized:
87
+
88
+ ```python
89
+ peak = np.abs(y).max()
90
+ if peak > 0:
91
+ y = y / peak * 0.99
92
+ ```
93
+
94
+ This ensures consistency and prevents the model from learning from corrupted audio.
95
+
96
+ ---
97
+
98
+ # 3. Advanced Preprocessing
99
+
100
+ Preprocessing included:
101
+
102
+ * Silence trimming
103
+ * Noise reduction
104
+ * Resampling → **16 kHz**
105
+ * Mono conversion
106
+ * 5-second chunking
107
+
108
+ ```python
109
+ TARGET_DURATION = 5.0
110
+ TARGET_SR = 16000
111
+ TARGET_LENGTH = int(TARGET_DURATION * TARGET_SR)
112
+ ```
113
+
114
+ Every audio file becomes a clean, consistent chunk ready for feature extraction.
115
+
116
+ ---
117
+
118
+ # 4. Audio Augmentation
119
+
120
+ To improve generalization, I applied augmentations:
121
+
122
+ ```python
123
+ augment = Compose([
124
+ Shift(min_shift=-0.3, max_shift=0.3, p=0.5),
125
+ PitchShift(min_semitones=-2, max_semitones=2, p=0.5),
126
+ TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),
127
+ AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5)
128
+ ])
129
+ ```
130
+
131
+ Every augmented file receives a unique name to avoid collisions.
132
+
133
+ ---
134
+
135
+ # 5. Mel Spectrogram Generation
136
+
137
+ Each cleaned audio chunk is transformed into a **Mel spectrogram**:
138
+
139
+ ```python
140
+ S = librosa.feature.melspectrogram(
141
+ y=y, sr=SR,
142
+ n_fft=N_FFT,
143
+ hop_length=HOP_LENGTH,
144
+ n_mels=N_MELS
145
+ )
146
+ S_dB = librosa.power_to_db(S, ref=np.max)
147
+ ```
148
+
149
+ * Output: **128×128 PNG images**
150
+ * Separate directories per class
151
+ * Supports both original & augmented samples
152
+
153
+ These images become the CNN input.
154
+
155
+ ### ***Example of Mel Spectrogram Images***
156
+
157
+ ![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F27304693%2Ffdf7046a261734cd8f503c8f448ca6ad%2Fdownload.png?generation=1763570826533634&alt=media)
158
+
159
+ ![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F27304693%2Fea53570ce051601192c90770091f7ceb%2Fdownload%20(1).png?generation=1763570855911665&alt=media)
160
+
161
+ ---
162
+
163
+ # 6. Dataset Validation
164
+
165
+ After spectrogram creation:
166
+
167
+ * Corrupted images removed
168
+ * Duplicate hashes filtered
169
+ * Filename integrity checked
170
+ * Class folders validated
171
+
172
+ ```python
173
+ df['file_hash'] = df['full_path'].apply(get_hash)
174
+ duplicate_hashes = df[df.duplicated(subset=['file_hash'], keep=False)]
175
+ ```
176
+
177
+ This step ensures **clean, reliable** training data.
178
+
179
+ ---
180
+
181
+ # 7. Building TensorFlow Datasets
182
+
183
+ The dataset is built with batching, caching, prefetching:
184
+
185
+ ```python
186
+ train_ds = tf.data.Dataset.from_tensor_slices((train_paths, train_labels))
187
+ train_ds = train_ds.map(load_and_preprocess, num_parallel_calls=AUTOTUNE)
188
+ train_ds = train_ds.shuffle(1024).batch(batch_size).prefetch(AUTOTUNE)
189
+ ```
190
+
191
+ I used a simple image-level augmentation pipeline:
192
+
193
+ ```python
194
+ data_augmentation = tf.keras.Sequential([
195
+ tf.keras.layers.InputLayer(input_shape=(231, 232, 4)),
196
+ tf.keras.layers.RandomFlip("horizontal"),
197
+ tf.keras.layers.RandomRotation(0.1),
198
+ tf.keras.layers.RandomZoom(0.1),
199
+ ])
200
+ ```
201
+
202
+ ---
203
+
204
+ # 8. CNN Architecture
205
+
206
+ The CNN captures deep frequency-time patterns across Mel images.
207
+
208
+ Key features:
209
+
210
+ * Multiple Conv2D + BatchNorm blocks
211
+ * Dropout
212
+ * L2 regularization
213
+ * Softmax output
214
+
215
+ ```python
216
+ model = Sequential([
217
+ data_augmentation,
218
+ Conv2D(32, (3,3), padding='same', activation='relu', kernel_regularizer=l2(weight_decay)),
219
+ BatchNormalization(),
220
+ MaxPooling2D((2,2)),
221
+ Dropout(0.2),
222
+ # ... more layers ...
223
+ Flatten(),
224
+ Dense(num_classes, activation='softmax')
225
+ ])
226
+ ```
227
+
228
+ ---
229
+
230
+ # 9. Training Strategy
231
+
232
+ ```python
233
+ reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=10)
234
+ early_stopping = EarlyStopping(monitor='val_loss', patience=40, restore_best_weights=True)
235
+
236
+ history = model.fit(
237
+ train_ds,
238
+ validation_data=val_ds,
239
+ epochs=50,
240
+ callbacks=[reduce_lr, early_stopping]
241
+ )
242
+ ```
243
+
244
+ The model converges smoothly while avoiding overfitting.
245
+
246
+ ---
247
+
248
+ # 10. Evaluation
249
+
250
+ Performance is evaluated using:
251
+
252
+ * Accuracy
253
+ * Precision, recall, F1-score
254
+ * Confusion matrix
255
+ * ROC/AUC curves
256
+
257
+ ```python
258
+ y_pred = np.argmax(model.predict(test_ds), axis=1)
259
+ print(classification_report(y_true, y_pred, target_names=le.classes_))
260
+ ```
261
+
262
+ Confusion matrix:
263
+
264
+ ```python
265
+ sns.heatmap(confusion_matrix(y_true, y_pred), annot=True, cmap='Blues')
266
+ plt.title("Confusion Matrix")
267
+ plt.show()
268
+ ```
269
+
270
+ ---
271
+
272
+ # 11. Saving the Model & Dataset
273
+
274
+ ```python
275
+ model.save("Audio_Model_Classification.h5")
276
+ shutil.make_archive("/content/spectrograms", 'zip', "/content/spectrograms")
277
+ ```
278
+
279
+ The entire spectrogram dataset is also zipped for sharing or deployment.
280
+
281
+ ---
282
+
283
+ # Final Notes
284
+
285
+ This project demonstrates:
286
+
287
+ * How to clean & prepare raw audio at a professional level
288
+ * Audio augmentation best practices
289
+ * How Mel spectrograms unlock CNN performance
290
+ * A full TensorFlow training pipeline
291
+ * Proper evaluation, reporting, and dataset integrity
292
+
293
+ If you're working on sound recognition, speech tasks, or environmental audio detection, this pipeline gives you a **complete production-grade foundation**.
294
+
295
+ ---
296
+
297
+ # **Results**
298
+ > **Note:** Click the image below to view the video showcasing the project’s results.
299
+ <a href="https://files.catbox.moe/suzziy.mp4">
300
+ <img src="https://images.unsplash.com/photo-1611162616475-46b635cb6868?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" width="400">
301
+ </a>
302
+
303
+ <hr style="border-bottom: 5px solid gray; margin-top: 10px;">
304
+
305
+ > **Note:** If the video above is not working, you can access it directly via the link below.
306
+
307
+ [Watch Demo Video](Results/Spectrogram_CNN_Audio_Classification.mp4)