Spaces:
Sleeping
Sleeping
| # Hierarchical Audio Classification for Washing Machine Sound Anomaly Detection - Methodology | |
| ### 1) Problem Framing | |
| We treat washing-machine sound understanding as a **two-stage hierarchical image classification** task: | |
| 1. **Stage-1 (Coarse):** Detect whether a sound is **Abnormal** or **Normal** from its Mel-spectrogram. | |
| 2. **Stage-2 (Fine):** If **Abnormal**, classify the failure mode (e.g., *Bearing noise*, *Dehydration mode noise*). If **Normal**, classify the operating mode (e.g., *Wash*, *Spin*). | |
| This decouples anomaly detection from mode identification and reduces class confusion. | |
| --- | |
| ### 2) Data & Labeling | |
| - **Source:** Short `.wav` recordings of washing-machine cycles (mono). | |
| - **Label Taxonomy:** | |
| ```bash | |
| 00-Abnormal/ | |
| ββ 00-1 - Background noise/ | |
| ββ 00-2 - Dehydration mode noise/ | |
| ββ 00-3 - Wash mode noise/ | |
| 01-Normal/ | |
| ββ 01-1 - Background noise/ | |
| ββ 01-2 - Dehydration mode noise/ | |
| ββ 01-3 - Wash mode noise/ | |
| ``` | |
| - **Granularity:** Each file is a single clip labeled at the folder level. | |
| > To avoid label leakage, clips from the **same physical machine / session** should not be split across train and validation sets (group-aware split). | |
| --- | |
| ### 3) Preprocessing β Mel-Spectrograms | |
| - **Audio params:** `sr=22050`, `n_fft=2048`, `hop_length=512`, `n_mels=128` | |
| - **Transform:** | |
| 1. Load mono audio: \( y \in \mathbb{R}^{T} \) | |
| 2. Mel power spectrogram: \( S = \text{MelSpec}(y; sr, n\_mels, n\_fft, hop) \) | |
| 3. Log scaling (dB): \( S_{dB} = 10 \log_{10} \left(\frac{S}{\max(S)}\right) \) | |
| - **Rendering:** `librosa.display.specshow(S_db, cmap="magma")`, save to PNG, **no axes**, `224Γ224` target size. | |
| - **Normalization:** Divide pixel values by `255.0` at model input. | |
| All scripts use the same constants to ensure train/test consistency. | |
| --- | |
| ### 4) Dataset Construction | |
| - **Stage-1 dataset:** `MelSpectrograms/` with the two top-level folders (`00 - Abnormal`, `01 - Normal`). | |
| - **Stage-2 datasets:** | |
| - **Abnormal head:** `MelSpectrograms/00 - Abnormal/*` | |
| - **Normal head:** `MelSpectrograms/01 - Normal/*` | |
| - **Splits:** `validation_split=0.2`, `seed=42` via `image_dataset_from_directory`. | |
| - **Class Order:** Persisted in `saved_models/label_meta.json` to guarantee consistent label β index mapping at inference. | |
| --- | |
| ### 5) Models & Architecture | |
| Both stages use a compact CNN to keep inference light: | |
| - **Backbone (per head):** | |
| - `Conv2D(32, 3Γ3) β ReLU β MaxPool(2Γ2)` | |
| - `Conv2D(64, 3Γ3) β ReLU β MaxPool(2Γ2)` | |
| - `Conv2D(128, 3Γ3) β ReLU β MaxPool(2Γ2)` | |
| - `Flatten β Dense(128) β ReLU β Dropout(0.3) β Dense(num_classes) β Softmax` | |
| - **Input:** `224Γ224Γ3` spectrogram images | |
| - **Loss:** `SparseCategoricalCrossentropy` | |
| - **Optimizer:** `Adam` | |
| - **Metrics:** `Accuracy` | |
| > Rationale: A simple CNN is sufficient for a strong baseline; the hierarchy offloads fine-grained distinctions to specialized heads. | |
| --- | |
| ### 6) Training Protocol | |
| - **Stage-1:** Train on `Normal` vs `Abnormal` spectrograms. | |
| - **Stage-2 Abnormal:** Train only on abnormal subclasses. | |
| - **Stage-2 Normal:** Train only on normal subclasses. | |
| - **Epochs:** `10` (baseline; tune as needed) | |
| - **Batch size:** `32` | |
| - **Pipelines:** `cache β (shuffle) β prefetch` with `tf.data.AUTOTUNE` | |
| - **Checkpointing:** Save each head to `saved_models/*.h5` and class orders to `label_meta.json`. | |
| Optional (recommended): | |
| - **Augmentations:** time masking, frequency masking, Gaussian noise on spectrograms, random time shifts on audio. | |
| - **Class imbalance:** oversampling minority subclasses or focal loss in Stage-2 heads. | |
| --- | |
| ### 7) Inference Flow (Hierarchical) | |
| **Input:** `.wav` β Mel-spectrogram β `224Γ224` | |
| 1. **Stage-1:** `p_stage1 = f_stage1(img)` β `y1 = argmax(p_stage1)` | |
| 2. **Route:** | |
| - If `y1 == "00 - Abnormal"` β use `abnormal_model` | |
| - Else β use `normal_model` | |
| 3. **Stage-2:** `p_stage2 = f_head(img)` β `y2 = argmax(p_stage2)` | |
| 4. **Output:** | |
| `final = f"{y1.split(' - ')[1]} β {class2}"` | |
| plus confidences: `max(p_stage1)`, `max(p_stage2)` | |
| **Pseudocode** | |
| ```python | |
| spec = to_mel_spectrogram(wav) | |
| img = preprocess(spec) # 224x224, /255.0 | |
| p1 = stage1_model(img) # [2] | |
| y1 = argmax(p1) | |
| head = abnormal_model if y1_is_abnormal else normal_model | |
| p2 = head(img) # [num_subclasses] | |
| y2 = argmax(p2) | |
| return { | |
| "stage1_class": class_names_stage1[y1], | |
| "stage1_confidence": max(p1), | |
| "stage2_class": class_names_stage2[y2], | |
| "stage2_confidence": max(p2), | |
| "final_prediction": ... | |
| } | |
| ``` | |
| ### 8) Evaluation | |
| - **Per-stage metrics:** accuracy, macro-F1, confusion matrices. | |
| - **End-to-end metric:** hierarchical accuracy = % of samples where both Stage-1 and Stage-2 predictions are correct. | |
| - **Calibration:** reliability curves / ECE on max_softmax for Stage-1 and Stage-2; optionally apply temperature scaling. | |
| - **Robustness checks:** background noise levels, recording device variance, different drum loads. | |
| - **Leakage control:** ensure clips from the same recording session are in one split only. | |
| ### 9) Deployment Considerations | |
| - **App:** Gradio front-end calls the same spectrogram + inference pipeline. | |
| - **Artifacts:** saved_models/{stage1,abnormal,normal}.h5 + saved_models/label_meta.json | |
| - **Reproducibility:** fixed audio/spectrogram params and consistent class order. | |
| - **Latency:** spectrogram generation dominates; keep n_fft/hop_length fixed and consider caching frequent uploads. | |
| ### 10) Limitations & Future Work | |
| - **Domain shift:** different washers/rooms/mics can reduce accuracy β consider domain adaptation / augmentation. | |
| - **Simple CNN:** replace with MobileNetV2/EfficientNet for improved accuracy at similar latency. | |
| - **Sequence modeling:** incorporate temporal context (e.g., ConvLSTM / Transformer over spectrogram patches). | |
| - **On-device:** quantize models (TFLite) for edge deployment. | |