Yousef07 commited on
Commit
d200ca0
Β·
verified Β·
1 Parent(s): f0cb872

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +268 -3
README.md CHANGED
@@ -1,3 +1,268 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - audio-classification
7
+ - pytorch
8
+ - se-resnet
9
+ - machine-fault-detection
10
+ - predictive-maintenance
11
+ - mel-spectrogram
12
+ pipeline_tag: audio-classification
13
+ ---
14
+
15
+ # Filter-Tank β€” Machine Fault Recognition
16
+
17
+ A deep learning system that listens to factory machine audio
18
+ recordings and classifies them into 6 categories across
19
+ 3 machine types, each in either a normal or abnormal state.
20
+ Built from scratch using SE-ResNet on log-mel spectrograms.
21
+
22
+ ---
23
+
24
+ ## Overview
25
+
26
+ Filter-Tank is a complete machine learning pipeline for
27
+ predictive maintenance. Given a raw `.wav` audio recording
28
+ of a factory machine, the system automatically detects
29
+ whether the machine is operating normally or has developed
30
+ a fault β€” and identifies which machine type it belongs to.
31
+
32
+ The model is a custom SE-ResNet (Squeeze-and-Excitation
33
+ ResNet) trained entirely from scratch with no pretrained
34
+ weights, designed specifically for 1-channel log-mel
35
+ spectrogram input.
36
+
37
+ ---
38
+
39
+ ## Classes
40
+
41
+ | Label | Description |
42
+ |-------|----------------------|
43
+ | 0 | Machine 1 β€” Normal |
44
+ | 1 | Machine 1 β€” Abnormal |
45
+ | 2 | Machine 2 β€” Normal |
46
+ | 3 | Machine 2 β€” Abnormal |
47
+ | 4 | Machine 3 β€” Normal |
48
+ | 5 | Machine 3 β€” Abnormal |
49
+
50
+ ---
51
+
52
+ ## Preprocessing Pipeline
53
+
54
+ Every audio file passes through a multi-stage preprocessing
55
+ pipeline before reaching the model. All steps run on CPU
56
+ and are excluded from the inference timer (only processing
57
+ + prediction time is measured).
58
+
59
+ ### 1. Resampling
60
+ All audio is resampled to a fixed sample rate of 16,000 Hz
61
+ to ensure consistency across recordings made with different
62
+ microphones or recording equipment.
63
+
64
+ ### 2. Noise Reduction
65
+ Non-stationary background noise is removed using the
66
+ `noisereduce` library with full noise reduction strength
67
+ (prop_decrease=1.0). This handles real-world factory
68
+ environments where background noise varies significantly
69
+ between recordings.
70
+
71
+ ### 3. Silence Trimming
72
+ Leading and trailing silence is removed using librosa's
73
+ trim function (top_db=20). This ensures the model focuses
74
+ only on the actual machine sound rather than quiet gaps
75
+ at the start or end of a recording.
76
+
77
+ ### 4. Fixed-Length Normalization
78
+ All recordings are normalized to exactly 11 seconds.
79
+ Files longer than 11 seconds are truncated from the end.
80
+ Files shorter than 11 seconds are zero-padded at the end.
81
+ This gives the model a consistent input size regardless
82
+ of the original recording length.
83
+
84
+ ### 5. Log-Mel Spectrogram
85
+ The waveform is converted into a 2D log-mel spectrogram
86
+ using the following settings:
87
+ - Mel bands: 128
88
+ - FFT window size: 1024
89
+ - Hop length: 512
90
+ - Power: 2.0 (power spectrogram)
91
+ - Amplitude converted to dB scale (top_db=80)
92
+
93
+ This transforms the raw audio signal into a visual
94
+ time-frequency representation that the convolutional
95
+ model can process effectively.
96
+
97
+ ### 6. CMVN Normalization
98
+ Cepstral Mean and Variance Normalization is applied
99
+ per sample β€” each spectrogram is normalized to have
100
+ zero mean and unit variance along the time axis.
101
+ This handles volume variations and differences in
102
+ microphone sensitivity across recordings.
103
+
104
+ ---
105
+
106
+ ## Model Architecture
107
+
108
+ ### SE-ResNet (Squeeze-and-Excitation ResNet)
109
+
110
+ The model follows a standard ResNet structure enhanced
111
+ with Squeeze-and-Excitation (SE) attention blocks at
112
+ every residual stage.
113
+
114
+ **Stem:** A 7x7 convolution (stride 2) followed by
115
+ batch normalization, ReLU, and max pooling reduces
116
+ the input resolution before the residual stages.
117
+
118
+ **4 Residual Stages:**
119
+ - Stage 1: 3 SE-Residual blocks, 64 channels
120
+ - Stage 2: 4 SE-Residual blocks, 128 channels (stride 2)
121
+ - Stage 3: 6 SE-Residual blocks, 256 channels (stride 2)
122
+ - Stage 4: 3 SE-Residual blocks, 512 channels (stride 2)
123
+
124
+ **SE Attention Block:** Each residual block includes a
125
+ Squeeze-and-Excitation module that performs global average
126
+ pooling, passes the result through two fully-connected
127
+ layers with a bottleneck (reduction=16), and produces
128
+ per-channel attention weights via sigmoid. This lets the
129
+ model focus on the most informative frequency channels
130
+ for each input.
131
+
132
+ **Head:** Global Average Pooling β†’ Dropout (0.3) β†’
133
+ Fully Connected layer β†’ 6-class output.
134
+
135
+ **Weight Initialization:**
136
+ - Conv layers: Kaiming Normal (fan_out, relu)
137
+ - BatchNorm: weight=1, bias=0
138
+ - Linear layers: Xavier Uniform
139
+
140
+ **Total Parameters:** ~11 million
141
+
142
+ ---
143
+
144
+ ## Training Details
145
+
146
+ ### Dataset Split
147
+ The dataset is divided using stratified splitting to
148
+ ensure balanced class representation across all splits:
149
+ - Training set: 80%
150
+ - Validation set: 10%
151
+ - Test set: 10%
152
+
153
+ Stratification is done by machine type and condition
154
+ combined, so each split has proportional representation
155
+ of all 6 classes.
156
+
157
+ ### Class Imbalance Handling
158
+ A WeightedRandomSampler is used during training to
159
+ oversample underrepresented classes, ensuring the model
160
+ sees a balanced distribution of all 6 classes per epoch
161
+ regardless of the original dataset distribution.
162
+
163
+ ### Data Augmentation
164
+ Two augmentation strategies are applied during training:
165
+
166
+ **SpecAugment (online, per batch):**
167
+ Applied directly to the spectrogram tensors during
168
+ training. Two frequency masks (freq_mask_param=20) and
169
+ two time masks (time_mask_param=40) are applied randomly,
170
+ forcing the model to be robust to missing frequency bands
171
+ and time segments.
172
+
173
+ **Mixup (online, per batch):**
174
+ Pairs of training samples are blended together with a
175
+ random interpolation weight drawn from a Beta distribution
176
+ (alpha=0.4). Both the input spectrograms and their labels
177
+ are mixed, which acts as a strong regularizer and improves
178
+ generalization.
179
+
180
+ ### Loss Function
181
+ Cross-Entropy Loss with label smoothing (0.1).
182
+ Label smoothing prevents overconfident predictions and
183
+ improves calibration.
184
+
185
+ ### Optimizer & Scheduler
186
+ - Optimizer: AdamW (weight decay=1e-4)
187
+ - Scheduler: OneCycleLR with cosine annealing
188
+ - Max LR: 3e-3
189
+ - Warmup: 10% of total steps
190
+ - Gradient clipping: max norm = 1.0
191
+
192
+ ### Mixed Precision Training
193
+ All forward and backward passes use torch.amp autocast
194
+ with float16 precision, reducing memory usage and
195
+ speeding up training on GPU.
196
+
197
+ ### Multi-GPU Support
198
+ The model supports DataParallel training across multiple
199
+ GPUs automatically. The best model state is always saved
200
+ from the unwrapped module to ensure compatibility
201
+ during single-GPU inference.
202
+
203
+ ### Early Stopping
204
+ Training stops automatically if validation accuracy
205
+ does not improve for 12 consecutive epochs (patience=12).
206
+ The best model checkpoint is saved based on validation
207
+ accuracy.
208
+
209
+ | Setting | Value |
210
+ |-----------------|------------------------------|
211
+ | Optimizer | AdamW |
212
+ | Max LR | 3e-3 |
213
+ | LR Schedule | OneCycleLR (cosine annealing)|
214
+ | Weight Decay | 1e-4 |
215
+ | Max Epochs | 60 |
216
+ | Early Stopping | Patience = 12 |
217
+ | Batch Size | 64 |
218
+ | Label Smoothing | 0.1 |
219
+ | Mixup Alpha | 0.4 |
220
+ | Mixed Precision | float16 (AMP) |
221
+ | Dropout | 0.3 |
222
+
223
+ ---
224
+
225
+ ## Inference
226
+
227
+ During inference, audio files are processed strictly
228
+ one-by-one in naturally sorted order (1.wav, 2.wav, ...).
229
+ The preprocessing pipeline runs on each file individually,
230
+ and only the processing + prediction time is measured
231
+ (I/O reading is excluded from the timer).
232
+
233
+ Two output files are produced:
234
+ - `results.txt` β€” one predicted class label (0–5) per line
235
+ - `time.txt` β€” processing time per file in seconds
236
+ (rounded to 3 decimal places)
237
+
238
+ ---
239
+
240
+ ## Requirements
241
+
242
+ - Python 3.8+
243
+ - PyTorch
244
+ - torchaudio
245
+ - librosa
246
+ - noisereduce
247
+ - numpy
248
+ - soundfile
249
+ - scikit-learn
250
+
251
+ ---
252
+
253
+ ## Limitations
254
+
255
+ - Trained only on 3 specific machine types; may not
256
+ generalize to unseen machine types out of the box
257
+ - Performance may degrade with extremely noisy
258
+ environments beyond the training distribution
259
+ - Fixed 11-second input window; very short recordings
260
+ are zero-padded which may affect accuracy
261
+
262
+ ---
263
+
264
+ ## Team
265
+
266
+ Cairo University β€” Faculty of Engineering
267
+ Computer Engineering Department
268
+ Pattern Recognition and Neural Networks β€” Spring 2026