Update README.md

d200ca0 verified 23 days ago

8.22 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- audio-classification
	- pytorch
	- se-resnet
	- machine-fault-detection
	- predictive-maintenance
	- mel-spectrogram
	pipeline_tag: audio-classification
	---

	# Filter-Tank — Machine Fault Recognition

	A deep learning system that listens to factory machine audio
	recordings and classifies them into 6 categories across
	3 machine types, each in either a normal or abnormal state.
	Built from scratch using SE-ResNet on log-mel spectrograms.

	---

	## Overview

	Filter-Tank is a complete machine learning pipeline for
	predictive maintenance. Given a raw `.wav` audio recording
	of a factory machine, the system automatically detects
	whether the machine is operating normally or has developed
	a fault — and identifies which machine type it belongs to.

	The model is a custom SE-ResNet (Squeeze-and-Excitation
	ResNet) trained entirely from scratch with no pretrained
	weights, designed specifically for 1-channel log-mel
	spectrogram input.

	---

	## Classes

	\| Label \| Description \|
	\|-------\|----------------------\|
	\| 0 \| Machine 1 — Normal \|
	\| 1 \| Machine 1 — Abnormal \|
	\| 2 \| Machine 2 — Normal \|
	\| 3 \| Machine 2 — Abnormal \|
	\| 4 \| Machine 3 — Normal \|
	\| 5 \| Machine 3 — Abnormal \|

	---

	## Preprocessing Pipeline

	Every audio file passes through a multi-stage preprocessing
	pipeline before reaching the model. All steps run on CPU
	and are excluded from the inference timer (only processing
	+ prediction time is measured).

	### 1. Resampling
	All audio is resampled to a fixed sample rate of 16,000 Hz
	to ensure consistency across recordings made with different
	microphones or recording equipment.

	### 2. Noise Reduction
	Non-stationary background noise is removed using the
	`noisereduce` library with full noise reduction strength
	(prop_decrease=1.0). This handles real-world factory
	environments where background noise varies significantly
	between recordings.

	### 3. Silence Trimming
	Leading and trailing silence is removed using librosa's
	trim function (top_db=20). This ensures the model focuses
	only on the actual machine sound rather than quiet gaps
	at the start or end of a recording.

	### 4. Fixed-Length Normalization
	All recordings are normalized to exactly 11 seconds.
	Files longer than 11 seconds are truncated from the end.
	Files shorter than 11 seconds are zero-padded at the end.
	This gives the model a consistent input size regardless
	of the original recording length.

	### 5. Log-Mel Spectrogram
	The waveform is converted into a 2D log-mel spectrogram
	using the following settings:
	- Mel bands: 128
	- FFT window size: 1024
	- Hop length: 512
	- Power: 2.0 (power spectrogram)
	- Amplitude converted to dB scale (top_db=80)

	This transforms the raw audio signal into a visual
	time-frequency representation that the convolutional
	model can process effectively.

	### 6. CMVN Normalization
	Cepstral Mean and Variance Normalization is applied
	per sample — each spectrogram is normalized to have
	zero mean and unit variance along the time axis.
	This handles volume variations and differences in
	microphone sensitivity across recordings.

	---

	## Model Architecture

	### SE-ResNet (Squeeze-and-Excitation ResNet)

	The model follows a standard ResNet structure enhanced
	with Squeeze-and-Excitation (SE) attention blocks at
	every residual stage.

	Stem: A 7x7 convolution (stride 2) followed by
	batch normalization, ReLU, and max pooling reduces
	the input resolution before the residual stages.

	4 Residual Stages:
	- Stage 1: 3 SE-Residual blocks, 64 channels
	- Stage 2: 4 SE-Residual blocks, 128 channels (stride 2)
	- Stage 3: 6 SE-Residual blocks, 256 channels (stride 2)
	- Stage 4: 3 SE-Residual blocks, 512 channels (stride 2)

	SE Attention Block: Each residual block includes a
	Squeeze-and-Excitation module that performs global average
	pooling, passes the result through two fully-connected
	layers with a bottleneck (reduction=16), and produces
	per-channel attention weights via sigmoid. This lets the
	model focus on the most informative frequency channels
	for each input.

	Head: Global Average Pooling → Dropout (0.3) →
	Fully Connected layer → 6-class output.

	Weight Initialization:
	- Conv layers: Kaiming Normal (fan_out, relu)
	- BatchNorm: weight=1, bias=0
	- Linear layers: Xavier Uniform

	Total Parameters: ~11 million

	---

	## Training Details

	### Dataset Split
	The dataset is divided using stratified splitting to
	ensure balanced class representation across all splits:
	- Training set: 80%
	- Validation set: 10%
	- Test set: 10%

	Stratification is done by machine type and condition
	combined, so each split has proportional representation
	of all 6 classes.

	### Class Imbalance Handling
	A WeightedRandomSampler is used during training to
	oversample underrepresented classes, ensuring the model
	sees a balanced distribution of all 6 classes per epoch
	regardless of the original dataset distribution.

	### Data Augmentation
	Two augmentation strategies are applied during training:

	SpecAugment (online, per batch):
	Applied directly to the spectrogram tensors during
	training. Two frequency masks (freq_mask_param=20) and
	two time masks (time_mask_param=40) are applied randomly,
	forcing the model to be robust to missing frequency bands
	and time segments.

	Mixup (online, per batch):
	Pairs of training samples are blended together with a
	random interpolation weight drawn from a Beta distribution
	(alpha=0.4). Both the input spectrograms and their labels
	are mixed, which acts as a strong regularizer and improves
	generalization.

	### Loss Function
	Cross-Entropy Loss with label smoothing (0.1).
	Label smoothing prevents overconfident predictions and
	improves calibration.

	### Optimizer & Scheduler
	- Optimizer: AdamW (weight decay=1e-4)
	- Scheduler: OneCycleLR with cosine annealing
	- Max LR: 3e-3
	- Warmup: 10% of total steps
	- Gradient clipping: max norm = 1.0

	### Mixed Precision Training
	All forward and backward passes use torch.amp autocast
	with float16 precision, reducing memory usage and
	speeding up training on GPU.

	### Multi-GPU Support
	The model supports DataParallel training across multiple
	GPUs automatically. The best model state is always saved
	from the unwrapped module to ensure compatibility
	during single-GPU inference.

	### Early Stopping
	Training stops automatically if validation accuracy
	does not improve for 12 consecutive epochs (patience=12).
	The best model checkpoint is saved based on validation
	accuracy.

	\| Setting \| Value \|
	\|-----------------\|------------------------------\|
	\| Optimizer \| AdamW \|
	\| Max LR \| 3e-3 \|
	\| LR Schedule \| OneCycleLR (cosine annealing)\|
	\| Weight Decay \| 1e-4 \|
	\| Max Epochs \| 60 \|
	\| Early Stopping \| Patience = 12 \|
	\| Batch Size \| 64 \|
	\| Label Smoothing \| 0.1 \|
	\| Mixup Alpha \| 0.4 \|
	\| Mixed Precision \| float16 (AMP) \|
	\| Dropout \| 0.3 \|

	---

	## Inference

	During inference, audio files are processed strictly
	one-by-one in naturally sorted order (1.wav, 2.wav, ...).
	The preprocessing pipeline runs on each file individually,
	and only the processing + prediction time is measured
	(I/O reading is excluded from the timer).

	Two output files are produced:
	- `results.txt` — one predicted class label (0–5) per line
	- `time.txt` — processing time per file in seconds
	(rounded to 3 decimal places)

	---

	## Requirements

	- Python 3.8+
	- PyTorch
	- torchaudio
	- librosa
	- noisereduce
	- numpy
	- soundfile
	- scikit-learn

	---

	## Limitations

	- Trained only on 3 specific machine types; may not
	generalize to unseen machine types out of the box
	- Performance may degrade with extremely noisy
	environments beyond the training distribution
	- Fixed 11-second input window; very short recordings
	are zero-padded which may affect accuracy

	---

	## Team

	Cairo University — Faculty of Engineering
	Computer Engineering Department
	Pattern Recognition and Neural Networks — Spring 2026