|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- audio-classification |
|
|
- tensorflow |
|
|
- mel-spectrogram-images |
|
|
- audio-processing |
|
|
inference: true |
|
|
datasets: |
|
|
- AIOmarRehan/Mel_Spectrogram_Images_for_Audio_Classification |
|
|
--- |
|
|
|
|
|
[If you would like a detailed explanation of this project, please refer to the Medium article below.](https://medium.com/@ai.omar.rehan/building-a-complete-audio-classification-pipeline-using-deep-learning-from-raw-audio-to-mel-9894bd438d85) |
|
|
|
|
|
--- |
|
|
|
|
|
[The project is also available for testing on Hugging Face.](https://huggingface.co/spaces/AIOmarRehan/Deep_Audio_Classifier_using_CNN) |
|
|
|
|
|
--- |
|
|
|
|
|
# Audio-Classification-Raw-Audio-to-Mel-Spectrogram-CNNs |
|
|
Complete end-to-end audio classification pipeline using deep learning. From raw recordings to Mel spectrogram CNNs, includes preprocessing, augmentation, dataset validation, model training, and evaluation — a reproducible blueprint for speech, environmental, or general sound classification tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
# Audio Classification Pipeline — From Raw Audio to Mel-Spectrogram CNNs |
|
|
|
|
|
> *“In machine learning, the model is rarely the problem — the data almost always is.”* |
|
|
> — A reminder I kept repeating to myself while building this project. |
|
|
|
|
|
This repository contains a complete, professional, end-to-end pipeline for **audio classification using deep learning**, starting from **raw, messy audio recordings** and ending with a fully trained **CNN model** using **Mel spectrograms**. |
|
|
|
|
|
The workflow includes: |
|
|
|
|
|
* Raw audio loading |
|
|
* Cleaning & normalization |
|
|
* Silence trimming |
|
|
* Noise reduction |
|
|
* Chunking |
|
|
* Data augmentation |
|
|
* Mel spectrogram generation |
|
|
* Dataset validation |
|
|
* CNN training |
|
|
* Evaluation & metrics |
|
|
|
|
|
It is a fully reproducible blueprint for real-world audio classification tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
# Project Structure |
|
|
|
|
|
Here is a quick table summarizing the core stages of the pipeline: |
|
|
|
|
|
| Stage | Description | Output | |
|
|
| ----------------------- | -------------------------------------- | ---------------- | |
|
|
| **1. Raw Audio** | Unprocessed WAV/MP3 files | Audio dataset | |
|
|
| **2. Preprocessing** | Trimming, cleaning, resampling | Cleaned signals | |
|
|
| **3. Augmentation** | Pitch shift, time stretch, noise | Expanded dataset | |
|
|
| **4. Mel Spectrograms** | Converts audio → images | PNG/IMG files | |
|
|
| **5. CNN Training** | Deep model learns spectrogram patterns | `.h5` model | |
|
|
| **6. Evaluation** | Accuracy, F1, Confusion Matrix | Metrics + plots | |
|
|
|
|
|
--- |
|
|
|
|
|
# 1. Loading & Inspecting Raw Audio |
|
|
|
|
|
The dataset is loaded from directory structure: |
|
|
|
|
|
```python |
|
|
paths = [(path.parts[-2], path.name, str(path)) |
|
|
for path in Path(extract_to).rglob('*.*') |
|
|
if path.suffix.lower() in audio_extensions] |
|
|
|
|
|
df = pd.DataFrame(paths, columns=['class', 'filename', 'full_path']) |
|
|
df = df.sort_values('class').reset_index(drop=True) |
|
|
``` |
|
|
|
|
|
During EDA, I computed: |
|
|
|
|
|
* Duration |
|
|
* Sample rate |
|
|
* Peak amplitude |
|
|
|
|
|
And visualized duration distribution: |
|
|
|
|
|
```python |
|
|
plt.hist(df['duration'], bins=30, edgecolor='black') |
|
|
plt.xlabel("Duration (seconds)") |
|
|
plt.ylabel("Number of recordings") |
|
|
plt.title("Audio Duration Distribution") |
|
|
plt.show() |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
# 2. Audio Cleaning & Normalization |
|
|
|
|
|
Bad samples were removed, silent files filtered, and amplitudes normalized: |
|
|
|
|
|
```python |
|
|
peak = np.abs(y).max() |
|
|
if peak > 0: |
|
|
y = y / peak * 0.99 |
|
|
``` |
|
|
|
|
|
This ensures consistency and prevents the model from learning from corrupted audio. |
|
|
|
|
|
--- |
|
|
|
|
|
# 3. Advanced Preprocessing |
|
|
|
|
|
Preprocessing included: |
|
|
|
|
|
* Silence trimming |
|
|
* Noise reduction |
|
|
* Resampling → **16 kHz** |
|
|
* Mono conversion |
|
|
* 5-second chunking |
|
|
|
|
|
```python |
|
|
TARGET_DURATION = 5.0 |
|
|
TARGET_SR = 16000 |
|
|
TARGET_LENGTH = int(TARGET_DURATION * TARGET_SR) |
|
|
``` |
|
|
|
|
|
Every audio file becomes a clean, consistent chunk ready for feature extraction. |
|
|
|
|
|
--- |
|
|
|
|
|
# 4. Audio Augmentation |
|
|
|
|
|
To improve generalization, I applied augmentations: |
|
|
|
|
|
```python |
|
|
augment = Compose([ |
|
|
Shift(min_shift=-0.3, max_shift=0.3, p=0.5), |
|
|
PitchShift(min_semitones=-2, max_semitones=2, p=0.5), |
|
|
TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5), |
|
|
AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5) |
|
|
]) |
|
|
``` |
|
|
|
|
|
Every augmented file receives a unique name to avoid collisions. |
|
|
|
|
|
--- |
|
|
|
|
|
# 5. Mel Spectrogram Generation |
|
|
|
|
|
Each cleaned audio chunk is transformed into a **Mel spectrogram**: |
|
|
|
|
|
```python |
|
|
S = librosa.feature.melspectrogram( |
|
|
y=y, sr=SR, |
|
|
n_fft=N_FFT, |
|
|
hop_length=HOP_LENGTH, |
|
|
n_mels=N_MELS |
|
|
) |
|
|
S_dB = librosa.power_to_db(S, ref=np.max) |
|
|
``` |
|
|
|
|
|
* Output: **128×128 PNG images** |
|
|
* Separate directories per class |
|
|
* Supports both original & augmented samples |
|
|
|
|
|
These images become the CNN input. |
|
|
|
|
|
### ***Example of Mel Spectrogram Images*** |
|
|
|
|
|
 |
|
|
|
|
|
.png?generation=1763570855911665&alt=media) |
|
|
|
|
|
--- |
|
|
|
|
|
# 6. Dataset Validation |
|
|
|
|
|
After spectrogram creation: |
|
|
|
|
|
* Corrupted images removed |
|
|
* Duplicate hashes filtered |
|
|
* Filename integrity checked |
|
|
* Class folders validated |
|
|
|
|
|
```python |
|
|
df['file_hash'] = df['full_path'].apply(get_hash) |
|
|
duplicate_hashes = df[df.duplicated(subset=['file_hash'], keep=False)] |
|
|
``` |
|
|
|
|
|
This step ensures **clean, reliable** training data. |
|
|
|
|
|
--- |
|
|
|
|
|
# 7. Building TensorFlow Datasets |
|
|
|
|
|
The dataset is built with batching, caching, prefetching: |
|
|
|
|
|
```python |
|
|
train_ds = tf.data.Dataset.from_tensor_slices((train_paths, train_labels)) |
|
|
train_ds = train_ds.map(load_and_preprocess, num_parallel_calls=AUTOTUNE) |
|
|
train_ds = train_ds.shuffle(1024).batch(batch_size).prefetch(AUTOTUNE) |
|
|
``` |
|
|
|
|
|
I used a simple image-level augmentation pipeline: |
|
|
|
|
|
```python |
|
|
data_augmentation = tf.keras.Sequential([ |
|
|
tf.keras.layers.InputLayer(input_shape=(231, 232, 4)), |
|
|
tf.keras.layers.RandomFlip("horizontal"), |
|
|
tf.keras.layers.RandomRotation(0.1), |
|
|
tf.keras.layers.RandomZoom(0.1), |
|
|
]) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
# 8. CNN Architecture |
|
|
|
|
|
The CNN captures deep frequency-time patterns across Mel images. |
|
|
|
|
|
Key features: |
|
|
|
|
|
* Multiple Conv2D + BatchNorm blocks |
|
|
* Dropout |
|
|
* L2 regularization |
|
|
* Softmax output |
|
|
|
|
|
```python |
|
|
model = Sequential([ |
|
|
data_augmentation, |
|
|
Conv2D(32, (3,3), padding='same', activation='relu', kernel_regularizer=l2(weight_decay)), |
|
|
BatchNormalization(), |
|
|
MaxPooling2D((2,2)), |
|
|
Dropout(0.2), |
|
|
# ... more layers ... |
|
|
Flatten(), |
|
|
Dense(num_classes, activation='softmax') |
|
|
]) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
# 9. Training Strategy |
|
|
|
|
|
```python |
|
|
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=10) |
|
|
early_stopping = EarlyStopping(monitor='val_loss', patience=40, restore_best_weights=True) |
|
|
|
|
|
history = model.fit( |
|
|
train_ds, |
|
|
validation_data=val_ds, |
|
|
epochs=50, |
|
|
callbacks=[reduce_lr, early_stopping] |
|
|
) |
|
|
``` |
|
|
|
|
|
The model converges smoothly while avoiding overfitting. |
|
|
|
|
|
--- |
|
|
|
|
|
# 10. Evaluation |
|
|
|
|
|
Performance is evaluated using: |
|
|
|
|
|
* Accuracy |
|
|
* Precision, recall, F1-score |
|
|
* Confusion matrix |
|
|
* ROC/AUC curves |
|
|
|
|
|
```python |
|
|
y_pred = np.argmax(model.predict(test_ds), axis=1) |
|
|
print(classification_report(y_true, y_pred, target_names=le.classes_)) |
|
|
``` |
|
|
|
|
|
Confusion matrix: |
|
|
|
|
|
```python |
|
|
sns.heatmap(confusion_matrix(y_true, y_pred), annot=True, cmap='Blues') |
|
|
plt.title("Confusion Matrix") |
|
|
plt.show() |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
# 11. Saving the Model & Dataset |
|
|
|
|
|
```python |
|
|
model.save("Audio_Model_Classification.h5") |
|
|
shutil.make_archive("/content/spectrograms", 'zip', "/content/spectrograms") |
|
|
``` |
|
|
|
|
|
The entire spectrogram dataset is also zipped for sharing or deployment. |
|
|
|
|
|
--- |
|
|
|
|
|
# Final Notes |
|
|
|
|
|
This project demonstrates: |
|
|
|
|
|
* How to clean & prepare raw audio at a professional level |
|
|
* Audio augmentation best practices |
|
|
* How Mel spectrograms unlock CNN performance |
|
|
* A full TensorFlow training pipeline |
|
|
* Proper evaluation, reporting, and dataset integrity |
|
|
|
|
|
If you're working on sound recognition, speech tasks, or environmental audio detection, this pipeline gives you a **complete production-grade foundation**. |
|
|
|
|
|
--- |
|
|
|
|
|
# **Results** |
|
|
> **Note:** Click the image below to view the video showcasing the project’s results. |
|
|
<a href="https://files.catbox.moe/suzziy.mp4"> |
|
|
<img src="https://images.unsplash.com/photo-1611162616475-46b635cb6868?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" width="400"> |
|
|
</a> |
|
|
|
|
|
<hr style="border-bottom: 5px solid gray; margin-top: 10px;"> |
|
|
|
|
|
> **Note:** If the video above is not working, you can access it directly via the link below. |
|
|
|
|
|
[Watch Demo Video](Results/Spectrogram_CNN_Audio_Classification.mp4) |
|
|
|