new_audio / methodology.md
Anvit25's picture
Update methodology.md (#3)
7faa59d verified

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

Hierarchical Audio Classification for Washing Machine Sound Anomaly Detection - Methodology

1) Problem Framing

We treat washing-machine sound understanding as a two-stage hierarchical image classification task:

  1. Stage-1 (Coarse): Detect whether a sound is Abnormal or Normal from its Mel-spectrogram.
  2. Stage-2 (Fine): If Abnormal, classify the failure mode (e.g., Bearing noise, Dehydration mode noise). If Normal, classify the operating mode (e.g., Wash, Spin).

This decouples anomaly detection from mode identification and reduces class confusion.


2) Data & Labeling

  • Source: Short .wav recordings of washing-machine cycles (mono).
  • Label Taxonomy:
00-Abnormal/
β”œβ”€ 00-1 - Background noise/
β”œβ”€ 00-2 - Dehydration mode noise/
└─ 00-3 - Wash mode noise/

01-Normal/
β”œβ”€ 01-1 - Background noise/
β”œβ”€ 01-2 - Dehydration mode noise/
└─ 01-3 - Wash mode noise/
  • Granularity: Each file is a single clip labeled at the folder level.

To avoid label leakage, clips from the same physical machine / session should not be split across train and validation sets (group-aware split).


3) Preprocessing β†’ Mel-Spectrograms

  • Audio params: sr=22050, n_fft=2048, hop_length=512, n_mels=128
  • Transform:
    1. Load mono audio: ( y \in \mathbb{R}^{T} )
    2. Mel power spectrogram: ( S = \text{MelSpec}(y; sr, n_mels, n_fft, hop) )
    3. Log scaling (dB): ( S_{dB} = 10 \log_{10} \left(\frac{S}{\max(S)}\right) )
  • Rendering: librosa.display.specshow(S_db, cmap="magma"), save to PNG, no axes, 224Γ—224 target size.
  • Normalization: Divide pixel values by 255.0 at model input.

All scripts use the same constants to ensure train/test consistency.


4) Dataset Construction

  • Stage-1 dataset: MelSpectrograms/ with the two top-level folders (00 - Abnormal, 01 - Normal).
  • Stage-2 datasets:
  • Abnormal head: MelSpectrograms/00 - Abnormal/*
  • Normal head: MelSpectrograms/01 - Normal/*
  • Splits: validation_split=0.2, seed=42 via image_dataset_from_directory.
  • Class Order: Persisted in saved_models/label_meta.json to guarantee consistent label ↔ index mapping at inference.

5) Models & Architecture

Both stages use a compact CNN to keep inference light:

  • Backbone (per head):
  • Conv2D(32, 3Γ—3) β†’ ReLU β†’ MaxPool(2Γ—2)
  • Conv2D(64, 3Γ—3) β†’ ReLU β†’ MaxPool(2Γ—2)
  • Conv2D(128, 3Γ—3) β†’ ReLU β†’ MaxPool(2Γ—2)
  • Flatten β†’ Dense(128) β†’ ReLU β†’ Dropout(0.3) β†’ Dense(num_classes) β†’ Softmax
  • Input: 224Γ—224Γ—3 spectrogram images
  • Loss: SparseCategoricalCrossentropy
  • Optimizer: Adam
  • Metrics: Accuracy

Rationale: A simple CNN is sufficient for a strong baseline; the hierarchy offloads fine-grained distinctions to specialized heads.


6) Training Protocol

  • Stage-1: Train on Normal vs Abnormal spectrograms.
  • Stage-2 Abnormal: Train only on abnormal subclasses.
  • Stage-2 Normal: Train only on normal subclasses.
  • Epochs: 10 (baseline; tune as needed)
  • Batch size: 32
  • Pipelines: cache β†’ (shuffle) β†’ prefetch with tf.data.AUTOTUNE
  • Checkpointing: Save each head to saved_models/*.h5 and class orders to label_meta.json.

Optional (recommended):

  • Augmentations: time masking, frequency masking, Gaussian noise on spectrograms, random time shifts on audio.
  • Class imbalance: oversampling minority subclasses or focal loss in Stage-2 heads.

7) Inference Flow (Hierarchical)

Input: .wav β†’ Mel-spectrogram β†’ 224Γ—224

  1. Stage-1: p_stage1 = f_stage1(img) β†’ y1 = argmax(p_stage1)

  2. Route:

  • If y1 == "00 - Abnormal" β†’ use abnormal_model
  • Else β†’ use normal_model
  1. Stage-2: p_stage2 = f_head(img) β†’ y2 = argmax(p_stage2)

  2. Output:
    final = f"{y1.split(' - ')[1]} β†’ {class2}"
    plus confidences: max(p_stage1), max(p_stage2)

Pseudocode

spec = to_mel_spectrogram(wav)
img  = preprocess(spec)  # 224x224, /255.0

p1 = stage1_model(img)                     # [2]
y1 = argmax(p1)

head = abnormal_model if y1_is_abnormal else normal_model
p2 = head(img)                             # [num_subclasses]
y2 = argmax(p2)

return {
"stage1_class": class_names_stage1[y1],
"stage1_confidence": max(p1),
"stage2_class": class_names_stage2[y2],
"stage2_confidence": max(p2),
"final_prediction": ...
}

8) Evaluation

  • Per-stage metrics: accuracy, macro-F1, confusion matrices.

  • End-to-end metric: hierarchical accuracy = % of samples where both Stage-1 and Stage-2 predictions are correct.

  • Calibration: reliability curves / ECE on max_softmax for Stage-1 and Stage-2; optionally apply temperature scaling.

  • Robustness checks: background noise levels, recording device variance, different drum loads.

  • Leakage control: ensure clips from the same recording session are in one split only.

9) Deployment Considerations

  • App: Gradio front-end calls the same spectrogram + inference pipeline.

  • Artifacts: saved_models/{stage1,abnormal,normal}.h5 + saved_models/label_meta.json

  • Reproducibility: fixed audio/spectrogram params and consistent class order.

  • Latency: spectrogram generation dominates; keep n_fft/hop_length fixed and consider caching frequent uploads.

10) Limitations & Future Work

  • Domain shift: different washers/rooms/mics can reduce accuracy β†’ consider domain adaptation / augmentation.

  • Simple CNN: replace with MobileNetV2/EfficientNet for improved accuracy at similar latency.

  • Sequence modeling: incorporate temporal context (e.g., ConvLSTM / Transformer over spectrogram patches).

  • On-device: quantize models (TFLite) for edge deployment.