Spaces:

Anvit25
/

new_audio

Sleeping

App Files Files Community

Create methodology.md

by mandarmgd-03 - opened Sep 29, 2025

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+160

-0

Files changed (1) hide show

methodology.md +160 -0

methodology.md ADDED Viewed

	@@ -0,0 +1,160 @@

+# Methodology
+### 1) Problem Framing
+We treat washing-machine sound understanding as a **two-stage hierarchical image classification** task:
+1. **Stage-1 (Coarse):** Detect whether a sound is **Abnormal** or **Normal** from its Mel-spectrogram.
+2. **Stage-2 (Fine):** If **Abnormal**, classify the failure mode (e.g., *Bearing noise*, *Dehydration mode noise*). If **Normal**, classify the operating mode (e.g., *Wash*, *Spin*).
+This decouples anomaly detection from mode identification and reduces class confusion.
+---
+### 2) Data & Labeling
+- **Source:** Short `.wav` recordings of washing-machine cycles (mono).
+- **Label Taxonomy:**
+00 - Abnormal/
+├─ Bearing noise/
+└─ Dehydration mode noise/
+01 - Normal/
+├─ Wash mode/
+└─ Spin mode/
+- **Granularity:** Each file is a single clip labeled at the folder level.
+> To avoid label leakage, clips from the **same physical machine / session** should not be split across train and validation sets (group-aware split).
+---
+### 3) Preprocessing → Mel-Spectrograms
+- **Audio params:** `sr=22050`, `n_fft=2048`, `hop_length=512`, `n_mels=128`
+- **Transform:**
+1. Load mono audio: \( y \in \mathbb{R}^{T} \)
+2. Mel power spectrogram: \( S = \text{MelSpec}(y; sr, n\_mels, n\_fft, hop) \)
+3. Log scaling (dB): \( S_{dB} = 10 \log_{10} \left(\frac{S}{\max(S)}\right) \)
+- **Rendering:** `librosa.display.specshow(S_db, cmap="magma")`, save to PNG, **no axes**, `224×224` target size.
+- **Normalization:** Divide pixel values by `255.0` at model input.
+All scripts use the same constants to ensure train/test consistency.
+---
+### 4) Dataset Construction
+- **Stage-1 dataset:** `MelSpectrograms/` with the two top-level folders (`00 - Abnormal`, `01 - Normal`).
+- **Stage-2 datasets:**
+- **Abnormal head:** `MelSpectrograms/00 - Abnormal/*`
+- **Normal head:** `MelSpectrograms/01 - Normal/*`
+- **Splits:** `validation_split=0.2`, `seed=42` via `image_dataset_from_directory`.
+- **Class Order:** Persisted in `saved_models/label_meta.json` to guarantee consistent label ↔ index mapping at inference.
+---
+### 5) Models & Architecture
+Both stages use a compact CNN to keep inference light:
+- **Backbone (per head):**
+- `Conv2D(32, 3×3) → ReLU → MaxPool(2×2)`
+- `Conv2D(64, 3×3) → ReLU → MaxPool(2×2)`
+- `Conv2D(128, 3×3) → ReLU → MaxPool(2×2)`
+- `Flatten → Dense(128) → ReLU → Dropout(0.3) → Dense(num_classes) → Softmax`
+- **Input:** `224×224×3` spectrogram images
+- **Loss:** `SparseCategoricalCrossentropy`
+- **Optimizer:** `Adam`
+- **Metrics:** `Accuracy`
+> Rationale: A simple CNN is sufficient for a strong baseline; the hierarchy offloads fine-grained distinctions to specialized heads.
+---
+### 6) Training Protocol
+- **Stage-1:** Train on `Normal` vs `Abnormal` spectrograms.
+- **Stage-2 Abnormal:** Train only on abnormal subclasses.
+- **Stage-2 Normal:** Train only on normal subclasses.
+- **Epochs:** `10` (baseline; tune as needed)
+- **Batch size:** `32`
+- **Pipelines:** `cache → (shuffle) → prefetch` with `tf.data.AUTOTUNE`
+- **Checkpointing:** Save each head to `saved_models/*.h5` and class orders to `label_meta.json`.
+Optional (recommended):
+- **Augmentations:** time masking, frequency masking, Gaussian noise on spectrograms, random time shifts on audio.
+- **Class imbalance:** oversampling minority subclasses or focal loss in Stage-2 heads.
+---
+### 7) Inference Flow (Hierarchical)
+**Input:** `.wav` → Mel-spectrogram → `224×224`
+1. **Stage-1:** `p_stage1 = f_stage1(img)` → `y1 = argmax(p_stage1)`
+2. **Route:**
+ - If `y1 == "00 - Abnormal"` → use `abnormal_model`
+ - Else → use `normal_model`
+3. **Stage-2:** `p_stage2 = f_head(img)` → `y2 = argmax(p_stage2)`
+4. **Output:**
+ `final = f"{y1.split(' - ')[1]} → {class2}"`
+ plus confidences: `max(p_stage1)`, `max(p_stage2)`
+**Pseudocode**
+```python
+spec = to_mel_spectrogram(wav)
+img  = preprocess(spec)  # 224x224, /255.0
+p1 = stage1_model(img)                     # [2]
+y1 = argmax(p1)
+head = abnormal_model if y1_is_abnormal else normal_model
+p2 = head(img)                             # [num_subclasses]
+y2 = argmax(p2)
+return {
+"stage1_class": class_names_stage1[y1],
+"stage1_confidence": max(p1),
+"stage2_class": class_names_stage2[y2],
+"stage2_confidence": max(p2),
+"final_prediction": ...
+}
+```
+### 8) Evaluation
+Per-stage metrics: accuracy, macro-F1, confusion matrices.
+End-to-end metric: hierarchical accuracy = % of samples where both Stage-1 and Stage-2 predictions are correct.
+Calibration: reliability curves / ECE on max_softmax for Stage-1 and Stage-2; optionally apply temperature scaling.
+Robustness checks: background noise levels, recording device variance, different drum loads.
+Leakage control: ensure clips from the same recording session are in one split only.
+### 9) Deployment Considerations
+App: Gradio front-end calls the same spectrogram + inference pipeline.
+Artifacts: saved_models/{stage1,abnormal,normal}.h5 + saved_models/label_meta.json
+Reproducibility: fixed audio/spectrogram params and consistent class order.
+Latency: spectrogram generation dominates; keep n_fft/hop_length fixed and consider caching frequent uploads.
+### 10) Limitations & Future Work
+Domain shift: different washers/rooms/mics can reduce accuracy → consider domain adaptation / augmentation.
+Simple CNN: replace with MobileNetV2/EfficientNet for improved accuracy at similar latency.
+Sequence modeling: incorporate temporal context (e.g., ConvLSTM / Transformer over spectrogram patches).
+On-device: quantize models (TFLite) for edge deployment.