Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.15.2
Hierarchical Audio Classification for Washing Machine Sound Anomaly Detection - Methodology
1) Problem Framing
We treat washing-machine sound understanding as a two-stage hierarchical image classification task:
- Stage-1 (Coarse): Detect whether a sound is Abnormal or Normal from its Mel-spectrogram.
- Stage-2 (Fine): If Abnormal, classify the failure mode (e.g., Bearing noise, Dehydration mode noise). If Normal, classify the operating mode (e.g., Wash, Spin).
This decouples anomaly detection from mode identification and reduces class confusion.
2) Data & Labeling
- Source: Short
.wavrecordings of washing-machine cycles (mono). - Label Taxonomy:
00-Abnormal/
ββ 00-1 - Background noise/
ββ 00-2 - Dehydration mode noise/
ββ 00-3 - Wash mode noise/
01-Normal/
ββ 01-1 - Background noise/
ββ 01-2 - Dehydration mode noise/
ββ 01-3 - Wash mode noise/
- Granularity: Each file is a single clip labeled at the folder level.
To avoid label leakage, clips from the same physical machine / session should not be split across train and validation sets (group-aware split).
3) Preprocessing β Mel-Spectrograms
- Audio params:
sr=22050,n_fft=2048,hop_length=512,n_mels=128 - Transform:
- Load mono audio: ( y \in \mathbb{R}^{T} )
- Mel power spectrogram: ( S = \text{MelSpec}(y; sr, n_mels, n_fft, hop) )
- Log scaling (dB): ( S_{dB} = 10 \log_{10} \left(\frac{S}{\max(S)}\right) )
- Rendering:
librosa.display.specshow(S_db, cmap="magma"), save to PNG, no axes,224Γ224target size. - Normalization: Divide pixel values by
255.0at model input.
All scripts use the same constants to ensure train/test consistency.
4) Dataset Construction
- Stage-1 dataset:
MelSpectrograms/with the two top-level folders (00 - Abnormal,01 - Normal). - Stage-2 datasets:
- Abnormal head:
MelSpectrograms/00 - Abnormal/* - Normal head:
MelSpectrograms/01 - Normal/* - Splits:
validation_split=0.2,seed=42viaimage_dataset_from_directory. - Class Order: Persisted in
saved_models/label_meta.jsonto guarantee consistent label β index mapping at inference.
5) Models & Architecture
Both stages use a compact CNN to keep inference light:
- Backbone (per head):
Conv2D(32, 3Γ3) β ReLU β MaxPool(2Γ2)Conv2D(64, 3Γ3) β ReLU β MaxPool(2Γ2)Conv2D(128, 3Γ3) β ReLU β MaxPool(2Γ2)Flatten β Dense(128) β ReLU β Dropout(0.3) β Dense(num_classes) β Softmax- Input:
224Γ224Γ3spectrogram images - Loss:
SparseCategoricalCrossentropy - Optimizer:
Adam - Metrics:
Accuracy
Rationale: A simple CNN is sufficient for a strong baseline; the hierarchy offloads fine-grained distinctions to specialized heads.
6) Training Protocol
- Stage-1: Train on
NormalvsAbnormalspectrograms. - Stage-2 Abnormal: Train only on abnormal subclasses.
- Stage-2 Normal: Train only on normal subclasses.
- Epochs:
10(baseline; tune as needed) - Batch size:
32 - Pipelines:
cache β (shuffle) β prefetchwithtf.data.AUTOTUNE - Checkpointing: Save each head to
saved_models/*.h5and class orders tolabel_meta.json.
Optional (recommended):
- Augmentations: time masking, frequency masking, Gaussian noise on spectrograms, random time shifts on audio.
- Class imbalance: oversampling minority subclasses or focal loss in Stage-2 heads.
7) Inference Flow (Hierarchical)
Input: .wav β Mel-spectrogram β 224Γ224
Stage-1:
p_stage1 = f_stage1(img)βy1 = argmax(p_stage1)Route:
- If
y1 == "00 - Abnormal"β useabnormal_model - Else β use
normal_model
Stage-2:
p_stage2 = f_head(img)βy2 = argmax(p_stage2)Output:
final = f"{y1.split(' - ')[1]} β {class2}"
plus confidences:max(p_stage1),max(p_stage2)
Pseudocode
spec = to_mel_spectrogram(wav)
img = preprocess(spec) # 224x224, /255.0
p1 = stage1_model(img) # [2]
y1 = argmax(p1)
head = abnormal_model if y1_is_abnormal else normal_model
p2 = head(img) # [num_subclasses]
y2 = argmax(p2)
return {
"stage1_class": class_names_stage1[y1],
"stage1_confidence": max(p1),
"stage2_class": class_names_stage2[y2],
"stage2_confidence": max(p2),
"final_prediction": ...
}
8) Evaluation
Per-stage metrics: accuracy, macro-F1, confusion matrices.
End-to-end metric: hierarchical accuracy = % of samples where both Stage-1 and Stage-2 predictions are correct.
Calibration: reliability curves / ECE on max_softmax for Stage-1 and Stage-2; optionally apply temperature scaling.
Robustness checks: background noise levels, recording device variance, different drum loads.
Leakage control: ensure clips from the same recording session are in one split only.
9) Deployment Considerations
App: Gradio front-end calls the same spectrogram + inference pipeline.
Artifacts: saved_models/{stage1,abnormal,normal}.h5 + saved_models/label_meta.json
Reproducibility: fixed audio/spectrogram params and consistent class order.
Latency: spectrogram generation dominates; keep n_fft/hop_length fixed and consider caching frequent uploads.
10) Limitations & Future Work
Domain shift: different washers/rooms/mics can reduce accuracy β consider domain adaptation / augmentation.
Simple CNN: replace with MobileNetV2/EfficientNet for improved accuracy at similar latency.
Sequence modeling: incorporate temporal context (e.g., ConvLSTM / Transformer over spectrogram patches).
On-device: quantize models (TFLite) for edge deployment.