Yousef07
/

Filters-Tank

+---
+license: apache-2.0
+language:
+  - en
+tags:
+  - audio-classification
+  - pytorch
+  - se-resnet
+  - machine-fault-detection
+  - predictive-maintenance
+  - mel-spectrogram
+pipeline_tag: audio-classification
+---
+# Filter-Tank — Machine Fault Recognition
+A deep learning system that listens to factory machine audio
+recordings and classifies them into 6 categories across
+3 machine types, each in either a normal or abnormal state.
+Built from scratch using SE-ResNet on log-mel spectrograms.
+---
+## Overview
+Filter-Tank is a complete machine learning pipeline for
+predictive maintenance. Given a raw `.wav` audio recording
+of a factory machine, the system automatically detects
+whether the machine is operating normally or has developed
+a fault — and identifies which machine type it belongs to.
+The model is a custom SE-ResNet (Squeeze-and-Excitation
+ResNet) trained entirely from scratch with no pretrained
+weights, designed specifically for 1-channel log-mel
+spectrogram input.
+---
+## Classes
+| Label | Description          |
+|-------|----------------------|
+| 0     | Machine 1 — Normal   |
+| 1     | Machine 1 — Abnormal |
+| 2     | Machine 2 — Normal   |
+| 3     | Machine 2 — Abnormal |
+| 4     | Machine 3 — Normal   |
+| 5     | Machine 3 — Abnormal |
+---
+## Preprocessing Pipeline
+Every audio file passes through a multi-stage preprocessing
+pipeline before reaching the model. All steps run on CPU
+and are excluded from the inference timer (only processing
++ prediction time is measured).
+### 1. Resampling
+All audio is resampled to a fixed sample rate of 16,000 Hz
+to ensure consistency across recordings made with different
+microphones or recording equipment.
+### 2. Noise Reduction
+Non-stationary background noise is removed using the
+`noisereduce` library with full noise reduction strength
+(prop_decrease=1.0). This handles real-world factory
+environments where background noise varies significantly
+between recordings.
+### 3. Silence Trimming
+Leading and trailing silence is removed using librosa's
+trim function (top_db=20). This ensures the model focuses
+only on the actual machine sound rather than quiet gaps
+at the start or end of a recording.
+### 4. Fixed-Length Normalization
+All recordings are normalized to exactly 11 seconds.
+Files longer than 11 seconds are truncated from the end.
+Files shorter than 11 seconds are zero-padded at the end.
+This gives the model a consistent input size regardless
+of the original recording length.
+### 5. Log-Mel Spectrogram
+The waveform is converted into a 2D log-mel spectrogram
+using the following settings:
+- Mel bands: 128
+- FFT window size: 1024
+- Hop length: 512
+- Power: 2.0 (power spectrogram)
+- Amplitude converted to dB scale (top_db=80)
+This transforms the raw audio signal into a visual
+time-frequency representation that the convolutional
+model can process effectively.
+### 6. CMVN Normalization
+Cepstral Mean and Variance Normalization is applied
+per sample — each spectrogram is normalized to have
+zero mean and unit variance along the time axis.
+This handles volume variations and differences in
+microphone sensitivity across recordings.
+---
+## Model Architecture
+### SE-ResNet (Squeeze-and-Excitation ResNet)
+The model follows a standard ResNet structure enhanced
+with Squeeze-and-Excitation (SE) attention blocks at
+every residual stage.
+**Stem:** A 7x7 convolution (stride 2) followed by
+batch normalization, ReLU, and max pooling reduces
+the input resolution before the residual stages.
+**4 Residual Stages:**
+- Stage 1: 3 SE-Residual blocks, 64 channels
+- Stage 2: 4 SE-Residual blocks, 128 channels (stride 2)
+- Stage 3: 6 SE-Residual blocks, 256 channels (stride 2)
+- Stage 4: 3 SE-Residual blocks, 512 channels (stride 2)
+**SE Attention Block:** Each residual block includes a
+Squeeze-and-Excitation module that performs global average
+pooling, passes the result through two fully-connected
+layers with a bottleneck (reduction=16), and produces
+per-channel attention weights via sigmoid. This lets the
+model focus on the most informative frequency channels
+for each input.
+**Head:** Global Average Pooling → Dropout (0.3) →
+Fully Connected layer → 6-class output.
+**Weight Initialization:**
+- Conv layers: Kaiming Normal (fan_out, relu)
+- BatchNorm: weight=1, bias=0
+- Linear layers: Xavier Uniform
+**Total Parameters:** ~11 million
+---
+## Training Details
+### Dataset Split
+The dataset is divided using stratified splitting to
+ensure balanced class representation across all splits:
+- Training set: 80%
+- Validation set: 10%
+- Test set: 10%
+Stratification is done by machine type and condition
+combined, so each split has proportional representation
+of all 6 classes.
+### Class Imbalance Handling
+A WeightedRandomSampler is used during training to
+oversample underrepresented classes, ensuring the model
+sees a balanced distribution of all 6 classes per epoch
+regardless of the original dataset distribution.
+### Data Augmentation
+Two augmentation strategies are applied during training:
+**SpecAugment (online, per batch):**
+Applied directly to the spectrogram tensors during
+training. Two frequency masks (freq_mask_param=20) and
+two time masks (time_mask_param=40) are applied randomly,
+forcing the model to be robust to missing frequency bands
+and time segments.
+**Mixup (online, per batch):**
+Pairs of training samples are blended together with a
+random interpolation weight drawn from a Beta distribution
+(alpha=0.4). Both the input spectrograms and their labels
+are mixed, which acts as a strong regularizer and improves
+generalization.
+### Loss Function
+Cross-Entropy Loss with label smoothing (0.1).
+Label smoothing prevents overconfident predictions and
+improves calibration.
+### Optimizer & Scheduler
+- Optimizer: AdamW (weight decay=1e-4)
+- Scheduler: OneCycleLR with cosine annealing
+  - Max LR: 3e-3
+  - Warmup: 10% of total steps
+- Gradient clipping: max norm = 1.0
+### Mixed Precision Training
+All forward and backward passes use torch.amp autocast
+with float16 precision, reducing memory usage and
+speeding up training on GPU.
+### Multi-GPU Support
+The model supports DataParallel training across multiple
+GPUs automatically. The best model state is always saved
+from the unwrapped module to ensure compatibility
+during single-GPU inference.
+### Early Stopping
+Training stops automatically if validation accuracy
+does not improve for 12 consecutive epochs (patience=12).
+The best model checkpoint is saved based on validation
+accuracy.
+| Setting         | Value                        |
+|-----------------|------------------------------|
+| Optimizer       | AdamW                        |
+| Max LR          | 3e-3                         |
+| LR Schedule     | OneCycleLR (cosine annealing)|
+| Weight Decay    | 1e-4                         |
+| Max Epochs      | 60                           |
+| Early Stopping  | Patience = 12                |
+| Batch Size      | 64                           |
+| Label Smoothing | 0.1                          |
+| Mixup Alpha     | 0.4                          |
+| Mixed Precision | float16 (AMP)                |
+| Dropout         | 0.3                          |
+---
+## Inference
+During inference, audio files are processed strictly
+one-by-one in naturally sorted order (1.wav, 2.wav, ...).
+The preprocessing pipeline runs on each file individually,
+and only the processing + prediction time is measured
+(I/O reading is excluded from the timer).
+Two output files are produced:
+- `results.txt` — one predicted class label (0–5) per line
+- `time.txt` — processing time per file in seconds
+  (rounded to 3 decimal places)
+---
+## Requirements
+- Python 3.8+
+- PyTorch
+- torchaudio
+- librosa
+- noisereduce
+- numpy
+- soundfile
+- scikit-learn
+---
+## Limitations
+- Trained only on 3 specific machine types; may not
+  generalize to unseen machine types out of the box
+- Performance may degrade with extremely noisy
+  environments beyond the training distribution
+- Fixed 11-second input window; very short recordings
+  are zero-padded which may affect accuracy
+---
+## Team
+Cairo University — Faculty of Engineering
+Computer Engineering Department
+Pattern Recognition and Neural Networks — Spring 2026