new_audio / methodology.md
Anvit25's picture
Update methodology.md (#3)
7faa59d verified
# Hierarchical Audio Classification for Washing Machine Sound Anomaly Detection - Methodology
### 1) Problem Framing
We treat washing-machine sound understanding as a **two-stage hierarchical image classification** task:
1. **Stage-1 (Coarse):** Detect whether a sound is **Abnormal** or **Normal** from its Mel-spectrogram.
2. **Stage-2 (Fine):** If **Abnormal**, classify the failure mode (e.g., *Bearing noise*, *Dehydration mode noise*). If **Normal**, classify the operating mode (e.g., *Wash*, *Spin*).
This decouples anomaly detection from mode identification and reduces class confusion.
---
### 2) Data & Labeling
- **Source:** Short `.wav` recordings of washing-machine cycles (mono).
- **Label Taxonomy:**
```bash
00-Abnormal/
β”œβ”€ 00-1 - Background noise/
β”œβ”€ 00-2 - Dehydration mode noise/
└─ 00-3 - Wash mode noise/
01-Normal/
β”œβ”€ 01-1 - Background noise/
β”œβ”€ 01-2 - Dehydration mode noise/
└─ 01-3 - Wash mode noise/
```
- **Granularity:** Each file is a single clip labeled at the folder level.
> To avoid label leakage, clips from the **same physical machine / session** should not be split across train and validation sets (group-aware split).
---
### 3) Preprocessing β†’ Mel-Spectrograms
- **Audio params:** `sr=22050`, `n_fft=2048`, `hop_length=512`, `n_mels=128`
- **Transform:**
1. Load mono audio: \( y \in \mathbb{R}^{T} \)
2. Mel power spectrogram: \( S = \text{MelSpec}(y; sr, n\_mels, n\_fft, hop) \)
3. Log scaling (dB): \( S_{dB} = 10 \log_{10} \left(\frac{S}{\max(S)}\right) \)
- **Rendering:** `librosa.display.specshow(S_db, cmap="magma")`, save to PNG, **no axes**, `224Γ—224` target size.
- **Normalization:** Divide pixel values by `255.0` at model input.
All scripts use the same constants to ensure train/test consistency.
---
### 4) Dataset Construction
- **Stage-1 dataset:** `MelSpectrograms/` with the two top-level folders (`00 - Abnormal`, `01 - Normal`).
- **Stage-2 datasets:**
- **Abnormal head:** `MelSpectrograms/00 - Abnormal/*`
- **Normal head:** `MelSpectrograms/01 - Normal/*`
- **Splits:** `validation_split=0.2`, `seed=42` via `image_dataset_from_directory`.
- **Class Order:** Persisted in `saved_models/label_meta.json` to guarantee consistent label ↔ index mapping at inference.
---
### 5) Models & Architecture
Both stages use a compact CNN to keep inference light:
- **Backbone (per head):**
- `Conv2D(32, 3Γ—3) β†’ ReLU β†’ MaxPool(2Γ—2)`
- `Conv2D(64, 3Γ—3) β†’ ReLU β†’ MaxPool(2Γ—2)`
- `Conv2D(128, 3Γ—3) β†’ ReLU β†’ MaxPool(2Γ—2)`
- `Flatten β†’ Dense(128) β†’ ReLU β†’ Dropout(0.3) β†’ Dense(num_classes) β†’ Softmax`
- **Input:** `224Γ—224Γ—3` spectrogram images
- **Loss:** `SparseCategoricalCrossentropy`
- **Optimizer:** `Adam`
- **Metrics:** `Accuracy`
> Rationale: A simple CNN is sufficient for a strong baseline; the hierarchy offloads fine-grained distinctions to specialized heads.
---
### 6) Training Protocol
- **Stage-1:** Train on `Normal` vs `Abnormal` spectrograms.
- **Stage-2 Abnormal:** Train only on abnormal subclasses.
- **Stage-2 Normal:** Train only on normal subclasses.
- **Epochs:** `10` (baseline; tune as needed)
- **Batch size:** `32`
- **Pipelines:** `cache β†’ (shuffle) β†’ prefetch` with `tf.data.AUTOTUNE`
- **Checkpointing:** Save each head to `saved_models/*.h5` and class orders to `label_meta.json`.
Optional (recommended):
- **Augmentations:** time masking, frequency masking, Gaussian noise on spectrograms, random time shifts on audio.
- **Class imbalance:** oversampling minority subclasses or focal loss in Stage-2 heads.
---
### 7) Inference Flow (Hierarchical)
**Input:** `.wav` β†’ Mel-spectrogram β†’ `224Γ—224`
1. **Stage-1:** `p_stage1 = f_stage1(img)` β†’ `y1 = argmax(p_stage1)`
2. **Route:**
- If `y1 == "00 - Abnormal"` β†’ use `abnormal_model`
- Else β†’ use `normal_model`
3. **Stage-2:** `p_stage2 = f_head(img)` β†’ `y2 = argmax(p_stage2)`
4. **Output:**
`final = f"{y1.split(' - ')[1]} β†’ {class2}"`
plus confidences: `max(p_stage1)`, `max(p_stage2)`
**Pseudocode**
```python
spec = to_mel_spectrogram(wav)
img = preprocess(spec) # 224x224, /255.0
p1 = stage1_model(img) # [2]
y1 = argmax(p1)
head = abnormal_model if y1_is_abnormal else normal_model
p2 = head(img) # [num_subclasses]
y2 = argmax(p2)
return {
"stage1_class": class_names_stage1[y1],
"stage1_confidence": max(p1),
"stage2_class": class_names_stage2[y2],
"stage2_confidence": max(p2),
"final_prediction": ...
}
```
### 8) Evaluation
- **Per-stage metrics:** accuracy, macro-F1, confusion matrices.
- **End-to-end metric:** hierarchical accuracy = % of samples where both Stage-1 and Stage-2 predictions are correct.
- **Calibration:** reliability curves / ECE on max_softmax for Stage-1 and Stage-2; optionally apply temperature scaling.
- **Robustness checks:** background noise levels, recording device variance, different drum loads.
- **Leakage control:** ensure clips from the same recording session are in one split only.
### 9) Deployment Considerations
- **App:** Gradio front-end calls the same spectrogram + inference pipeline.
- **Artifacts:** saved_models/{stage1,abnormal,normal}.h5 + saved_models/label_meta.json
- **Reproducibility:** fixed audio/spectrogram params and consistent class order.
- **Latency:** spectrogram generation dominates; keep n_fft/hop_length fixed and consider caching frequent uploads.
### 10) Limitations & Future Work
- **Domain shift:** different washers/rooms/mics can reduce accuracy β†’ consider domain adaptation / augmentation.
- **Simple CNN:** replace with MobileNetV2/EfficientNet for improved accuracy at similar latency.
- **Sequence modeling:** incorporate temporal context (e.g., ConvLSTM / Transformer over spectrogram patches).
- **On-device:** quantize models (TFLite) for edge deployment.