Update README.md

f0e0963 verified 18 days ago

8.71 kB

	---
	language: en
	license: mit
	tags:
	- audio-classification
	- tensorflow
	- mel-spectrogram-images
	- audio-processing
	inference: true
	datasets:
	- AIOmarRehan/Mel_Spectrogram_Images_for_Audio_Classification
	---

	[If you would like a detailed explanation of this project, please refer to the Medium article below.](https://medium.com/@ai.omar.rehan/building-a-complete-audio-classification-pipeline-using-deep-learning-from-raw-audio-to-mel-9894bd438d85)

	---

	[The project is also available for testing on Hugging Face.](https://huggingface.co/spaces/AIOmarRehan/Deep_Audio_Classifier_using_CNN)

	---

	# Audio-Classification-Raw-Audio-to-Mel-Spectrogram-CNNs
	Complete end-to-end audio classification pipeline using deep learning. From raw recordings to Mel spectrogram CNNs, includes preprocessing, augmentation, dataset validation, model training, and evaluation — a reproducible blueprint for speech, environmental, or general sound classification tasks.

	---

	# Audio Classification Pipeline — From Raw Audio to Mel-Spectrogram CNNs

	> “In machine learning, the model is rarely the problem — the data almost always is.”
	> — A reminder I kept repeating to myself while building this project.

	This repository contains a complete, professional, end-to-end pipeline for audio classification using deep learning, starting from raw, messy audio recordings and ending with a fully trained CNN model using Mel spectrograms.

	The workflow includes:

	* Raw audio loading
	* Cleaning & normalization
	* Silence trimming
	* Noise reduction
	* Chunking
	* Data augmentation
	* Mel spectrogram generation
	* Dataset validation
	* CNN training
	* Evaluation & metrics

	It is a fully reproducible blueprint for real-world audio classification tasks.

	---

	# Project Structure

	Here is a quick table summarizing the core stages of the pipeline:

	\| Stage \| Description \| Output \|
	\| ----------------------- \| -------------------------------------- \| ---------------- \|
	\| 1. Raw Audio \| Unprocessed WAV/MP3 files \| Audio dataset \|
	\| 2. Preprocessing \| Trimming, cleaning, resampling \| Cleaned signals \|
	\| 3. Augmentation \| Pitch shift, time stretch, noise \| Expanded dataset \|
	\| 4. Mel Spectrograms \| Converts audio → images \| PNG/IMG files \|
	\| 5. CNN Training \| Deep model learns spectrogram patterns \| `.h5` model \|
	\| 6. Evaluation \| Accuracy, F1, Confusion Matrix \| Metrics + plots \|

	---

	# 1. Loading & Inspecting Raw Audio

	The dataset is loaded from directory structure:

	```python
	paths = [(path.parts[-2], path.name, str(path))
	for path in Path(extract_to).rglob('.')
	if path.suffix.lower() in audio_extensions]

	df = pd.DataFrame(paths, columns=['class', 'filename', 'full_path'])
	df = df.sort_values('class').reset_index(drop=True)
	```

	During EDA, I computed:

	* Duration
	* Sample rate
	* Peak amplitude

	And visualized duration distribution:

	```python
	plt.hist(df['duration'], bins=30, edgecolor='black')
	plt.xlabel("Duration (seconds)")
	plt.ylabel("Number of recordings")
	plt.title("Audio Duration Distribution")
	plt.show()
	```

	---

	# 2. Audio Cleaning & Normalization

	Bad samples were removed, silent files filtered, and amplitudes normalized:

	```python
	peak = np.abs(y).max()
	if peak > 0:
	y = y / peak * 0.99
	```

	This ensures consistency and prevents the model from learning from corrupted audio.

	---

	# 3. Advanced Preprocessing

	Preprocessing included:

	* Silence trimming
	* Noise reduction
	* Resampling → 16 kHz
	* Mono conversion
	* 5-second chunking

	```python
	TARGET_DURATION = 5.0
	TARGET_SR = 16000
	TARGET_LENGTH = int(TARGET_DURATION * TARGET_SR)
	```

	Every audio file becomes a clean, consistent chunk ready for feature extraction.

	---

	# 4. Audio Augmentation

	To improve generalization, I applied augmentations:

	```python
	augment = Compose([
	Shift(min_shift=-0.3, max_shift=0.3, p=0.5),
	PitchShift(min_semitones=-2, max_semitones=2, p=0.5),
	TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),
	AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5)
	])
	```

	Every augmented file receives a unique name to avoid collisions.

	---

	# 5. Mel Spectrogram Generation

	Each cleaned audio chunk is transformed into a Mel spectrogram:

	```python
	S = librosa.feature.melspectrogram(
	y=y, sr=SR,
	n_fft=N_FFT,
	hop_length=HOP_LENGTH,
	n_mels=N_MELS
	)
	S_dB = librosa.power_to_db(S, ref=np.max)
	```

	* Output: 128×128 PNG images
	* Separate directories per class
	* Supports both original & augmented samples

	These images become the CNN input.

	### *Example of Mel Spectrogram Images*

	![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F27304693%2Ffdf7046a261734cd8f503c8f448ca6ad%2Fdownload.png?generation=1763570826533634&alt=media)

	![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F27304693%2Fea53570ce051601192c90770091f7ceb%2Fdownload%20(1).png?generation=1763570855911665&alt=media)

	---

	# 6. Dataset Validation

	After spectrogram creation:

	* Corrupted images removed
	* Duplicate hashes filtered
	* Filename integrity checked
	* Class folders validated

	```python
	df['file_hash'] = df['full_path'].apply(get_hash)
	duplicate_hashes = df[df.duplicated(subset=['file_hash'], keep=False)]
	```

	This step ensures clean, reliable training data.

	---

	# 7. Building TensorFlow Datasets

	The dataset is built with batching, caching, prefetching:

	```python
	train_ds = tf.data.Dataset.from_tensor_slices((train_paths, train_labels))
	train_ds = train_ds.map(load_and_preprocess, num_parallel_calls=AUTOTUNE)
	train_ds = train_ds.shuffle(1024).batch(batch_size).prefetch(AUTOTUNE)
	```

	I used a simple image-level augmentation pipeline:

	```python
	data_augmentation = tf.keras.Sequential([
	tf.keras.layers.InputLayer(input_shape=(231, 232, 4)),
	tf.keras.layers.RandomFlip("horizontal"),
	tf.keras.layers.RandomRotation(0.1),
	tf.keras.layers.RandomZoom(0.1),
	])
	```

	---

	# 8. CNN Architecture

	The CNN captures deep frequency-time patterns across Mel images.

	Key features:

	* Multiple Conv2D + BatchNorm blocks
	* Dropout
	* L2 regularization
	* Softmax output

	```python
	model = Sequential([
	data_augmentation,
	Conv2D(32, (3,3), padding='same', activation='relu', kernel_regularizer=l2(weight_decay)),
	BatchNormalization(),
	MaxPooling2D((2,2)),
	Dropout(0.2),
	# ... more layers ...
	Flatten(),
	Dense(num_classes, activation='softmax')
	])
	```

	---

	# 9. Training Strategy

	```python
	reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=10)
	early_stopping = EarlyStopping(monitor='val_loss', patience=40, restore_best_weights=True)

	history = model.fit(
	train_ds,
	validation_data=val_ds,
	epochs=50,
	callbacks=[reduce_lr, early_stopping]
	)
	```

	The model converges smoothly while avoiding overfitting.

	---

	# 10. Evaluation

	Performance is evaluated using:

	* Accuracy
	* Precision, recall, F1-score
	* Confusion matrix
	* ROC/AUC curves

	```python
	y_pred = np.argmax(model.predict(test_ds), axis=1)
	print(classification_report(y_true, y_pred, target_names=le.classes_))
	```

	Confusion matrix:

	```python
	sns.heatmap(confusion_matrix(y_true, y_pred), annot=True, cmap='Blues')
	plt.title("Confusion Matrix")
	plt.show()
	```

	---

	# 11. Saving the Model & Dataset

	```python
	model.save("Audio_Model_Classification.h5")
	shutil.make_archive("/content/spectrograms", 'zip', "/content/spectrograms")
	```

	The entire spectrogram dataset is also zipped for sharing or deployment.

	---

	# Final Notes

	This project demonstrates:

	* How to clean & prepare raw audio at a professional level
	* Audio augmentation best practices
	* How Mel spectrograms unlock CNN performance
	* A full TensorFlow training pipeline
	* Proper evaluation, reporting, and dataset integrity

	If you're working on sound recognition, speech tasks, or environmental audio detection, this pipeline gives you a complete production-grade foundation.

	---

	# Results
	> Note: Click the image below to view the video showcasing the project’s results.
	<a href="https://files.catbox.moe/suzziy.mp4">
	<img src="https://images.unsplash.com/photo-1611162616475-46b635cb6868?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" width="400">
	</a>

	<hr style="border-bottom: 5px solid gray; margin-top: 10px;">

	> Note: If the video above is not working, you can access it directly via the link below.

	[Watch Demo Video](Results/Spectrogram_CNN_Audio_Classification.mp4)