Spaces:

Anvit25
/

new_audio

Sleeping

App Files Files Community

new_audio / methodology.md

Anvit25

Update methodology.md (#3)

7faa59d verified 8 months ago

preview code

raw

history blame contribute delete

5.89 kB

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

Hierarchical Audio Classification for Washing Machine Sound Anomaly Detection - Methodology

1) Problem Framing

We treat washing-machine sound understanding as a two-stage hierarchical image classification task:

Stage-1 (Coarse): Detect whether a sound is Abnormal or Normal from its Mel-spectrogram.
Stage-2 (Fine): If Abnormal, classify the failure mode (e.g., Bearing noise, Dehydration mode noise). If Normal, classify the operating mode (e.g., Wash, Spin).

This decouples anomaly detection from mode identification and reduces class confusion.

2) Data & Labeling

Source: Short .wav recordings of washing-machine cycles (mono).
Label Taxonomy:

00-Abnormal/
├─ 00-1 - Background noise/
├─ 00-2 - Dehydration mode noise/
└─ 00-3 - Wash mode noise/

01-Normal/
├─ 01-1 - Background noise/
├─ 01-2 - Dehydration mode noise/
└─ 01-3 - Wash mode noise/

Granularity: Each file is a single clip labeled at the folder level.

To avoid label leakage, clips from the same physical machine / session should not be split across train and validation sets (group-aware split).

3) Preprocessing → Mel-Spectrograms

Audio params: sr=22050, n_fft=2048, hop_length=512, n_mels=128
Transform:
1. Load mono audio: ( y \in \mathbb{R}^{T} )
2. Mel power spectrogram: ( S = \text{MelSpec}(y; sr, n_mels, n_fft, hop) )
3. Log scaling (dB): ( S_{dB} = 10 \log_{10} \left(\frac{S}{\max(S)}\right) )
Rendering: librosa.display.specshow(S_db, cmap="magma"), save to PNG, no axes, 224×224 target size.
Normalization: Divide pixel values by 255.0 at model input.

All scripts use the same constants to ensure train/test consistency.

4) Dataset Construction

Stage-1 dataset: MelSpectrograms/ with the two top-level folders (00 - Abnormal, 01 - Normal).
Stage-2 datasets:
Abnormal head: MelSpectrograms/00 - Abnormal/*
Normal head: MelSpectrograms/01 - Normal/*
Splits: validation_split=0.2, seed=42 via image_dataset_from_directory.
Class Order: Persisted in saved_models/label_meta.json to guarantee consistent label ↔ index mapping at inference.

5) Models & Architecture

Both stages use a compact CNN to keep inference light:

Backbone (per head):
Conv2D(32, 3×3) → ReLU → MaxPool(2×2)
Conv2D(64, 3×3) → ReLU → MaxPool(2×2)
Conv2D(128, 3×3) → ReLU → MaxPool(2×2)
Flatten → Dense(128) → ReLU → Dropout(0.3) → Dense(num_classes) → Softmax
Input: 224×224×3 spectrogram images
Loss: SparseCategoricalCrossentropy
Optimizer: Adam
Metrics: Accuracy

Rationale: A simple CNN is sufficient for a strong baseline; the hierarchy offloads fine-grained distinctions to specialized heads.

6) Training Protocol

Stage-1: Train on Normal vs Abnormal spectrograms.
Stage-2 Abnormal: Train only on abnormal subclasses.
Stage-2 Normal: Train only on normal subclasses.
Epochs: 10 (baseline; tune as needed)
Batch size: 32
Pipelines: cache → (shuffle) → prefetch with tf.data.AUTOTUNE
Checkpointing: Save each head to saved_models/*.h5 and class orders to label_meta.json.

Optional (recommended):

Augmentations: time masking, frequency masking, Gaussian noise on spectrograms, random time shifts on audio.
Class imbalance: oversampling minority subclasses or focal loss in Stage-2 heads.

7) Inference Flow (Hierarchical)

Input: .wav → Mel-spectrogram → 224×224

Stage-1: p_stage1 = f_stage1(img) → y1 = argmax(p_stage1)
Route:

If y1 == "00 - Abnormal" → use abnormal_model
Else → use normal_model

Stage-2: p_stage2 = f_head(img) → y2 = argmax(p_stage2)
Output:
final = f"{y1.split(' - ')[1]} → {class2}"
plus confidences: max(p_stage1), max(p_stage2)

Pseudocode

spec = to_mel_spectrogram(wav)
img  = preprocess(spec)  # 224x224, /255.0

p1 = stage1_model(img)                     # [2]
y1 = argmax(p1)

head = abnormal_model if y1_is_abnormal else normal_model
p2 = head(img)                             # [num_subclasses]
y2 = argmax(p2)

return {
"stage1_class": class_names_stage1[y1],
"stage1_confidence": max(p1),
"stage2_class": class_names_stage2[y2],
"stage2_confidence": max(p2),
"final_prediction": ...
}

8) Evaluation

Per-stage metrics: accuracy, macro-F1, confusion matrices.
End-to-end metric: hierarchical accuracy = % of samples where both Stage-1 and Stage-2 predictions are correct.
Calibration: reliability curves / ECE on max_softmax for Stage-1 and Stage-2; optionally apply temperature scaling.
Robustness checks: background noise levels, recording device variance, different drum loads.
Leakage control: ensure clips from the same recording session are in one split only.

9) Deployment Considerations

App: Gradio front-end calls the same spectrogram + inference pipeline.
Artifacts: saved_models/{stage1,abnormal,normal}.h5 + saved_models/label_meta.json
Reproducibility: fixed audio/spectrogram params and consistent class order.
Latency: spectrogram generation dominates; keep n_fft/hop_length fixed and consider caching frequent uploads.

10) Limitations & Future Work

Domain shift: different washers/rooms/mics can reduce accuracy → consider domain adaptation / augmentation.
Simple CNN: replace with MobileNetV2/EfficientNet for improved accuracy at similar latency.
Sequence modeling: incorporate temporal context (e.g., ConvLSTM / Transformer over spectrogram patches).
On-device: quantize models (TFLite) for edge deployment.