File size: 8,216 Bytes
d200ca0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
---
license: apache-2.0
language:
  - en
tags:
  - audio-classification
  - pytorch
  - se-resnet
  - machine-fault-detection
  - predictive-maintenance
  - mel-spectrogram
pipeline_tag: audio-classification
---

# Filter-Tank β€” Machine Fault Recognition

A deep learning system that listens to factory machine audio
recordings and classifies them into 6 categories across
3 machine types, each in either a normal or abnormal state.
Built from scratch using SE-ResNet on log-mel spectrograms.

---

## Overview

Filter-Tank is a complete machine learning pipeline for
predictive maintenance. Given a raw `.wav` audio recording
of a factory machine, the system automatically detects
whether the machine is operating normally or has developed
a fault β€” and identifies which machine type it belongs to.

The model is a custom SE-ResNet (Squeeze-and-Excitation
ResNet) trained entirely from scratch with no pretrained
weights, designed specifically for 1-channel log-mel
spectrogram input.

---

## Classes

| Label | Description          |
|-------|----------------------|
| 0     | Machine 1 β€” Normal   |
| 1     | Machine 1 β€” Abnormal |
| 2     | Machine 2 β€” Normal   |
| 3     | Machine 2 β€” Abnormal |
| 4     | Machine 3 β€” Normal   |
| 5     | Machine 3 β€” Abnormal |

---

## Preprocessing Pipeline

Every audio file passes through a multi-stage preprocessing
pipeline before reaching the model. All steps run on CPU
and are excluded from the inference timer (only processing
+ prediction time is measured).

### 1. Resampling
All audio is resampled to a fixed sample rate of 16,000 Hz
to ensure consistency across recordings made with different
microphones or recording equipment.

### 2. Noise Reduction
Non-stationary background noise is removed using the
`noisereduce` library with full noise reduction strength
(prop_decrease=1.0). This handles real-world factory
environments where background noise varies significantly
between recordings.

### 3. Silence Trimming
Leading and trailing silence is removed using librosa's
trim function (top_db=20). This ensures the model focuses
only on the actual machine sound rather than quiet gaps
at the start or end of a recording.

### 4. Fixed-Length Normalization
All recordings are normalized to exactly 11 seconds.
Files longer than 11 seconds are truncated from the end.
Files shorter than 11 seconds are zero-padded at the end.
This gives the model a consistent input size regardless
of the original recording length.

### 5. Log-Mel Spectrogram
The waveform is converted into a 2D log-mel spectrogram
using the following settings:
- Mel bands: 128
- FFT window size: 1024
- Hop length: 512
- Power: 2.0 (power spectrogram)
- Amplitude converted to dB scale (top_db=80)

This transforms the raw audio signal into a visual
time-frequency representation that the convolutional
model can process effectively.

### 6. CMVN Normalization
Cepstral Mean and Variance Normalization is applied
per sample β€” each spectrogram is normalized to have
zero mean and unit variance along the time axis.
This handles volume variations and differences in
microphone sensitivity across recordings.

---

## Model Architecture

### SE-ResNet (Squeeze-and-Excitation ResNet)

The model follows a standard ResNet structure enhanced
with Squeeze-and-Excitation (SE) attention blocks at
every residual stage.

**Stem:** A 7x7 convolution (stride 2) followed by
batch normalization, ReLU, and max pooling reduces
the input resolution before the residual stages.

**4 Residual Stages:**
- Stage 1: 3 SE-Residual blocks, 64 channels
- Stage 2: 4 SE-Residual blocks, 128 channels (stride 2)
- Stage 3: 6 SE-Residual blocks, 256 channels (stride 2)
- Stage 4: 3 SE-Residual blocks, 512 channels (stride 2)

**SE Attention Block:** Each residual block includes a
Squeeze-and-Excitation module that performs global average
pooling, passes the result through two fully-connected
layers with a bottleneck (reduction=16), and produces
per-channel attention weights via sigmoid. This lets the
model focus on the most informative frequency channels
for each input.

**Head:** Global Average Pooling β†’ Dropout (0.3) β†’
Fully Connected layer β†’ 6-class output.

**Weight Initialization:**
- Conv layers: Kaiming Normal (fan_out, relu)
- BatchNorm: weight=1, bias=0
- Linear layers: Xavier Uniform

**Total Parameters:** ~11 million

---

## Training Details

### Dataset Split
The dataset is divided using stratified splitting to
ensure balanced class representation across all splits:
- Training set: 80%
- Validation set: 10%
- Test set: 10%

Stratification is done by machine type and condition
combined, so each split has proportional representation
of all 6 classes.

### Class Imbalance Handling
A WeightedRandomSampler is used during training to
oversample underrepresented classes, ensuring the model
sees a balanced distribution of all 6 classes per epoch
regardless of the original dataset distribution.

### Data Augmentation
Two augmentation strategies are applied during training:

**SpecAugment (online, per batch):**
Applied directly to the spectrogram tensors during
training. Two frequency masks (freq_mask_param=20) and
two time masks (time_mask_param=40) are applied randomly,
forcing the model to be robust to missing frequency bands
and time segments.

**Mixup (online, per batch):**
Pairs of training samples are blended together with a
random interpolation weight drawn from a Beta distribution
(alpha=0.4). Both the input spectrograms and their labels
are mixed, which acts as a strong regularizer and improves
generalization.

### Loss Function
Cross-Entropy Loss with label smoothing (0.1).
Label smoothing prevents overconfident predictions and
improves calibration.

### Optimizer & Scheduler
- Optimizer: AdamW (weight decay=1e-4)
- Scheduler: OneCycleLR with cosine annealing
  - Max LR: 3e-3
  - Warmup: 10% of total steps
- Gradient clipping: max norm = 1.0

### Mixed Precision Training
All forward and backward passes use torch.amp autocast
with float16 precision, reducing memory usage and
speeding up training on GPU.

### Multi-GPU Support
The model supports DataParallel training across multiple
GPUs automatically. The best model state is always saved
from the unwrapped module to ensure compatibility
during single-GPU inference.

### Early Stopping
Training stops automatically if validation accuracy
does not improve for 12 consecutive epochs (patience=12).
The best model checkpoint is saved based on validation
accuracy.

| Setting         | Value                        |
|-----------------|------------------------------|
| Optimizer       | AdamW                        |
| Max LR          | 3e-3                         |
| LR Schedule     | OneCycleLR (cosine annealing)|
| Weight Decay    | 1e-4                         |
| Max Epochs      | 60                           |
| Early Stopping  | Patience = 12                |
| Batch Size      | 64                           |
| Label Smoothing | 0.1                          |
| Mixup Alpha     | 0.4                          |
| Mixed Precision | float16 (AMP)                |
| Dropout         | 0.3                          |

---

## Inference

During inference, audio files are processed strictly
one-by-one in naturally sorted order (1.wav, 2.wav, ...).
The preprocessing pipeline runs on each file individually,
and only the processing + prediction time is measured
(I/O reading is excluded from the timer).

Two output files are produced:
- `results.txt` β€” one predicted class label (0–5) per line
- `time.txt` β€” processing time per file in seconds
  (rounded to 3 decimal places)

---

## Requirements

- Python 3.8+
- PyTorch
- torchaudio
- librosa
- noisereduce
- numpy
- soundfile
- scikit-learn

---

## Limitations

- Trained only on 3 specific machine types; may not
  generalize to unseen machine types out of the box
- Performance may degrade with extremely noisy
  environments beyond the training distribution
- Fixed 11-second input window; very short recordings
  are zero-padded which may affect accuracy

---

## Team

Cairo University β€” Faculty of Engineering
Computer Engineering Department
Pattern Recognition and Neural Networks β€” Spring 2026