| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - audio-classification |
| - pytorch |
| - se-resnet |
| - machine-fault-detection |
| - predictive-maintenance |
| - mel-spectrogram |
| pipeline_tag: audio-classification |
| --- |
| |
| # Filter-Tank — Machine Fault Recognition |
|
|
| A deep learning system that listens to factory machine audio |
| recordings and classifies them into 6 categories across |
| 3 machine types, each in either a normal or abnormal state. |
| Built from scratch using SE-ResNet on log-mel spectrograms. |
|
|
| --- |
|
|
| ## Overview |
|
|
| Filter-Tank is a complete machine learning pipeline for |
| predictive maintenance. Given a raw `.wav` audio recording |
| of a factory machine, the system automatically detects |
| whether the machine is operating normally or has developed |
| a fault — and identifies which machine type it belongs to. |
|
|
| The model is a custom SE-ResNet (Squeeze-and-Excitation |
| ResNet) trained entirely from scratch with no pretrained |
| weights, designed specifically for 1-channel log-mel |
| spectrogram input. |
|
|
| --- |
|
|
| ## Classes |
|
|
| | Label | Description | |
| |-------|----------------------| |
| | 0 | Machine 1 — Normal | |
| | 1 | Machine 1 — Abnormal | |
| | 2 | Machine 2 — Normal | |
| | 3 | Machine 2 — Abnormal | |
| | 4 | Machine 3 — Normal | |
| | 5 | Machine 3 — Abnormal | |
|
|
| --- |
|
|
| ## Preprocessing Pipeline |
|
|
| Every audio file passes through a multi-stage preprocessing |
| pipeline before reaching the model. All steps run on CPU |
| and are excluded from the inference timer (only processing |
| + prediction time is measured). |
|
|
| ### 1. Resampling |
| All audio is resampled to a fixed sample rate of 16,000 Hz |
| to ensure consistency across recordings made with different |
| microphones or recording equipment. |
|
|
| ### 2. Noise Reduction |
| Non-stationary background noise is removed using the |
| `noisereduce` library with full noise reduction strength |
| (prop_decrease=1.0). This handles real-world factory |
| environments where background noise varies significantly |
| between recordings. |
| |
| ### 3. Silence Trimming |
| Leading and trailing silence is removed using librosa's |
| trim function (top_db=20). This ensures the model focuses |
| only on the actual machine sound rather than quiet gaps |
| at the start or end of a recording. |
|
|
| ### 4. Fixed-Length Normalization |
| All recordings are normalized to exactly 11 seconds. |
| Files longer than 11 seconds are truncated from the end. |
| Files shorter than 11 seconds are zero-padded at the end. |
| This gives the model a consistent input size regardless |
| of the original recording length. |
|
|
| ### 5. Log-Mel Spectrogram |
| The waveform is converted into a 2D log-mel spectrogram |
| using the following settings: |
| - Mel bands: 128 |
| - FFT window size: 1024 |
| - Hop length: 512 |
| - Power: 2.0 (power spectrogram) |
| - Amplitude converted to dB scale (top_db=80) |
| |
| This transforms the raw audio signal into a visual |
| time-frequency representation that the convolutional |
| model can process effectively. |
| |
| ### 6. CMVN Normalization |
| Cepstral Mean and Variance Normalization is applied |
| per sample — each spectrogram is normalized to have |
| zero mean and unit variance along the time axis. |
| This handles volume variations and differences in |
| microphone sensitivity across recordings. |
| |
| --- |
| |
| ## Model Architecture |
| |
| ### SE-ResNet (Squeeze-and-Excitation ResNet) |
| |
| The model follows a standard ResNet structure enhanced |
| with Squeeze-and-Excitation (SE) attention blocks at |
| every residual stage. |
| |
| **Stem:** A 7x7 convolution (stride 2) followed by |
| batch normalization, ReLU, and max pooling reduces |
| the input resolution before the residual stages. |
| |
| **4 Residual Stages:** |
| - Stage 1: 3 SE-Residual blocks, 64 channels |
| - Stage 2: 4 SE-Residual blocks, 128 channels (stride 2) |
| - Stage 3: 6 SE-Residual blocks, 256 channels (stride 2) |
| - Stage 4: 3 SE-Residual blocks, 512 channels (stride 2) |
| |
| **SE Attention Block:** Each residual block includes a |
| Squeeze-and-Excitation module that performs global average |
| pooling, passes the result through two fully-connected |
| layers with a bottleneck (reduction=16), and produces |
| per-channel attention weights via sigmoid. This lets the |
| model focus on the most informative frequency channels |
| for each input. |
| |
| **Head:** Global Average Pooling → Dropout (0.3) → |
| Fully Connected layer → 6-class output. |
| |
| **Weight Initialization:** |
| - Conv layers: Kaiming Normal (fan_out, relu) |
| - BatchNorm: weight=1, bias=0 |
| - Linear layers: Xavier Uniform |
|
|
| **Total Parameters:** ~11 million |
|
|
| --- |
|
|
| ## Training Details |
|
|
| ### Dataset Split |
| The dataset is divided using stratified splitting to |
| ensure balanced class representation across all splits: |
| - Training set: 80% |
| - Validation set: 10% |
| - Test set: 10% |
|
|
| Stratification is done by machine type and condition |
| combined, so each split has proportional representation |
| of all 6 classes. |
|
|
| ### Class Imbalance Handling |
| A WeightedRandomSampler is used during training to |
| oversample underrepresented classes, ensuring the model |
| sees a balanced distribution of all 6 classes per epoch |
| regardless of the original dataset distribution. |
|
|
| ### Data Augmentation |
| Two augmentation strategies are applied during training: |
|
|
| **SpecAugment (online, per batch):** |
| Applied directly to the spectrogram tensors during |
| training. Two frequency masks (freq_mask_param=20) and |
| two time masks (time_mask_param=40) are applied randomly, |
| forcing the model to be robust to missing frequency bands |
| and time segments. |
|
|
| **Mixup (online, per batch):** |
| Pairs of training samples are blended together with a |
| random interpolation weight drawn from a Beta distribution |
| (alpha=0.4). Both the input spectrograms and their labels |
| are mixed, which acts as a strong regularizer and improves |
| generalization. |
|
|
| ### Loss Function |
| Cross-Entropy Loss with label smoothing (0.1). |
| Label smoothing prevents overconfident predictions and |
| improves calibration. |
|
|
| ### Optimizer & Scheduler |
| - Optimizer: AdamW (weight decay=1e-4) |
| - Scheduler: OneCycleLR with cosine annealing |
| - Max LR: 3e-3 |
| - Warmup: 10% of total steps |
| - Gradient clipping: max norm = 1.0 |
|
|
| ### Mixed Precision Training |
| All forward and backward passes use torch.amp autocast |
| with float16 precision, reducing memory usage and |
| speeding up training on GPU. |
|
|
| ### Multi-GPU Support |
| The model supports DataParallel training across multiple |
| GPUs automatically. The best model state is always saved |
| from the unwrapped module to ensure compatibility |
| during single-GPU inference. |
|
|
| ### Early Stopping |
| Training stops automatically if validation accuracy |
| does not improve for 12 consecutive epochs (patience=12). |
| The best model checkpoint is saved based on validation |
| accuracy. |
|
|
| | Setting | Value | |
| |-----------------|------------------------------| |
| | Optimizer | AdamW | |
| | Max LR | 3e-3 | |
| | LR Schedule | OneCycleLR (cosine annealing)| |
| | Weight Decay | 1e-4 | |
| | Max Epochs | 60 | |
| | Early Stopping | Patience = 12 | |
| | Batch Size | 64 | |
| | Label Smoothing | 0.1 | |
| | Mixup Alpha | 0.4 | |
| | Mixed Precision | float16 (AMP) | |
| | Dropout | 0.3 | |
|
|
| --- |
|
|
| ## Inference |
|
|
| During inference, audio files are processed strictly |
| one-by-one in naturally sorted order (1.wav, 2.wav, ...). |
| The preprocessing pipeline runs on each file individually, |
| and only the processing + prediction time is measured |
| (I/O reading is excluded from the timer). |
|
|
| Two output files are produced: |
| - `results.txt` — one predicted class label (0–5) per line |
| - `time.txt` — processing time per file in seconds |
| (rounded to 3 decimal places) |
|
|
| --- |
|
|
| ## Requirements |
|
|
| - Python 3.8+ |
| - PyTorch |
| - torchaudio |
| - librosa |
| - noisereduce |
| - numpy |
| - soundfile |
| - scikit-learn |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - Trained only on 3 specific machine types; may not |
| generalize to unseen machine types out of the box |
| - Performance may degrade with extremely noisy |
| environments beyond the training distribution |
| - Fixed 11-second input window; very short recordings |
| are zero-padded which may affect accuracy |
|
|
| --- |
|
|
| ## Team |
|
|
| Cairo University — Faculty of Engineering |
| Computer Engineering Department |
| Pattern Recognition and Neural Networks — Spring 2026 |
|
|