File size: 8,216 Bytes
d200ca0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 | ---
license: apache-2.0
language:
- en
tags:
- audio-classification
- pytorch
- se-resnet
- machine-fault-detection
- predictive-maintenance
- mel-spectrogram
pipeline_tag: audio-classification
---
# Filter-Tank β Machine Fault Recognition
A deep learning system that listens to factory machine audio
recordings and classifies them into 6 categories across
3 machine types, each in either a normal or abnormal state.
Built from scratch using SE-ResNet on log-mel spectrograms.
---
## Overview
Filter-Tank is a complete machine learning pipeline for
predictive maintenance. Given a raw `.wav` audio recording
of a factory machine, the system automatically detects
whether the machine is operating normally or has developed
a fault β and identifies which machine type it belongs to.
The model is a custom SE-ResNet (Squeeze-and-Excitation
ResNet) trained entirely from scratch with no pretrained
weights, designed specifically for 1-channel log-mel
spectrogram input.
---
## Classes
| Label | Description |
|-------|----------------------|
| 0 | Machine 1 β Normal |
| 1 | Machine 1 β Abnormal |
| 2 | Machine 2 β Normal |
| 3 | Machine 2 β Abnormal |
| 4 | Machine 3 β Normal |
| 5 | Machine 3 β Abnormal |
---
## Preprocessing Pipeline
Every audio file passes through a multi-stage preprocessing
pipeline before reaching the model. All steps run on CPU
and are excluded from the inference timer (only processing
+ prediction time is measured).
### 1. Resampling
All audio is resampled to a fixed sample rate of 16,000 Hz
to ensure consistency across recordings made with different
microphones or recording equipment.
### 2. Noise Reduction
Non-stationary background noise is removed using the
`noisereduce` library with full noise reduction strength
(prop_decrease=1.0). This handles real-world factory
environments where background noise varies significantly
between recordings.
### 3. Silence Trimming
Leading and trailing silence is removed using librosa's
trim function (top_db=20). This ensures the model focuses
only on the actual machine sound rather than quiet gaps
at the start or end of a recording.
### 4. Fixed-Length Normalization
All recordings are normalized to exactly 11 seconds.
Files longer than 11 seconds are truncated from the end.
Files shorter than 11 seconds are zero-padded at the end.
This gives the model a consistent input size regardless
of the original recording length.
### 5. Log-Mel Spectrogram
The waveform is converted into a 2D log-mel spectrogram
using the following settings:
- Mel bands: 128
- FFT window size: 1024
- Hop length: 512
- Power: 2.0 (power spectrogram)
- Amplitude converted to dB scale (top_db=80)
This transforms the raw audio signal into a visual
time-frequency representation that the convolutional
model can process effectively.
### 6. CMVN Normalization
Cepstral Mean and Variance Normalization is applied
per sample β each spectrogram is normalized to have
zero mean and unit variance along the time axis.
This handles volume variations and differences in
microphone sensitivity across recordings.
---
## Model Architecture
### SE-ResNet (Squeeze-and-Excitation ResNet)
The model follows a standard ResNet structure enhanced
with Squeeze-and-Excitation (SE) attention blocks at
every residual stage.
**Stem:** A 7x7 convolution (stride 2) followed by
batch normalization, ReLU, and max pooling reduces
the input resolution before the residual stages.
**4 Residual Stages:**
- Stage 1: 3 SE-Residual blocks, 64 channels
- Stage 2: 4 SE-Residual blocks, 128 channels (stride 2)
- Stage 3: 6 SE-Residual blocks, 256 channels (stride 2)
- Stage 4: 3 SE-Residual blocks, 512 channels (stride 2)
**SE Attention Block:** Each residual block includes a
Squeeze-and-Excitation module that performs global average
pooling, passes the result through two fully-connected
layers with a bottleneck (reduction=16), and produces
per-channel attention weights via sigmoid. This lets the
model focus on the most informative frequency channels
for each input.
**Head:** Global Average Pooling β Dropout (0.3) β
Fully Connected layer β 6-class output.
**Weight Initialization:**
- Conv layers: Kaiming Normal (fan_out, relu)
- BatchNorm: weight=1, bias=0
- Linear layers: Xavier Uniform
**Total Parameters:** ~11 million
---
## Training Details
### Dataset Split
The dataset is divided using stratified splitting to
ensure balanced class representation across all splits:
- Training set: 80%
- Validation set: 10%
- Test set: 10%
Stratification is done by machine type and condition
combined, so each split has proportional representation
of all 6 classes.
### Class Imbalance Handling
A WeightedRandomSampler is used during training to
oversample underrepresented classes, ensuring the model
sees a balanced distribution of all 6 classes per epoch
regardless of the original dataset distribution.
### Data Augmentation
Two augmentation strategies are applied during training:
**SpecAugment (online, per batch):**
Applied directly to the spectrogram tensors during
training. Two frequency masks (freq_mask_param=20) and
two time masks (time_mask_param=40) are applied randomly,
forcing the model to be robust to missing frequency bands
and time segments.
**Mixup (online, per batch):**
Pairs of training samples are blended together with a
random interpolation weight drawn from a Beta distribution
(alpha=0.4). Both the input spectrograms and their labels
are mixed, which acts as a strong regularizer and improves
generalization.
### Loss Function
Cross-Entropy Loss with label smoothing (0.1).
Label smoothing prevents overconfident predictions and
improves calibration.
### Optimizer & Scheduler
- Optimizer: AdamW (weight decay=1e-4)
- Scheduler: OneCycleLR with cosine annealing
- Max LR: 3e-3
- Warmup: 10% of total steps
- Gradient clipping: max norm = 1.0
### Mixed Precision Training
All forward and backward passes use torch.amp autocast
with float16 precision, reducing memory usage and
speeding up training on GPU.
### Multi-GPU Support
The model supports DataParallel training across multiple
GPUs automatically. The best model state is always saved
from the unwrapped module to ensure compatibility
during single-GPU inference.
### Early Stopping
Training stops automatically if validation accuracy
does not improve for 12 consecutive epochs (patience=12).
The best model checkpoint is saved based on validation
accuracy.
| Setting | Value |
|-----------------|------------------------------|
| Optimizer | AdamW |
| Max LR | 3e-3 |
| LR Schedule | OneCycleLR (cosine annealing)|
| Weight Decay | 1e-4 |
| Max Epochs | 60 |
| Early Stopping | Patience = 12 |
| Batch Size | 64 |
| Label Smoothing | 0.1 |
| Mixup Alpha | 0.4 |
| Mixed Precision | float16 (AMP) |
| Dropout | 0.3 |
---
## Inference
During inference, audio files are processed strictly
one-by-one in naturally sorted order (1.wav, 2.wav, ...).
The preprocessing pipeline runs on each file individually,
and only the processing + prediction time is measured
(I/O reading is excluded from the timer).
Two output files are produced:
- `results.txt` β one predicted class label (0β5) per line
- `time.txt` β processing time per file in seconds
(rounded to 3 decimal places)
---
## Requirements
- Python 3.8+
- PyTorch
- torchaudio
- librosa
- noisereduce
- numpy
- soundfile
- scikit-learn
---
## Limitations
- Trained only on 3 specific machine types; may not
generalize to unseen machine types out of the box
- Performance may degrade with extremely noisy
environments beyond the training distribution
- Fixed 11-second input window; very short recordings
are zero-padded which may affect accuracy
---
## Team
Cairo University β Faculty of Engineering
Computer Engineering Department
Pattern Recognition and Neural Networks β Spring 2026
|