Spaces:

Anvit25
/

new_audio

Sleeping

App Files Files Community

new_audio / methodology.md

Anvit25

Update methodology.md (#3)

7faa59d verified 8 months ago

preview code

raw

history blame contribute delete

5.89 kB

	# Hierarchical Audio Classification for Washing Machine Sound Anomaly Detection - Methodology

	### 1) Problem Framing
	We treat washing-machine sound understanding as a two-stage hierarchical image classification task:

	1. Stage-1 (Coarse): Detect whether a sound is Abnormal or Normal from its Mel-spectrogram.
	2. Stage-2 (Fine): If Abnormal, classify the failure mode (e.g., Bearing noise, Dehydration mode noise). If Normal, classify the operating mode (e.g., Wash, Spin).

	This decouples anomaly detection from mode identification and reduces class confusion.

	---

	### 2) Data & Labeling

	- Source: Short `.wav` recordings of washing-machine cycles (mono).
	- Label Taxonomy:

	```bash
	00-Abnormal/
	├─ 00-1 - Background noise/
	├─ 00-2 - Dehydration mode noise/
	└─ 00-3 - Wash mode noise/

	01-Normal/
	├─ 01-1 - Background noise/
	├─ 01-2 - Dehydration mode noise/
	└─ 01-3 - Wash mode noise/
	```


	- Granularity: Each file is a single clip labeled at the folder level.

	> To avoid label leakage, clips from the same physical machine / session should not be split across train and validation sets (group-aware split).

	---

	### 3) Preprocessing → Mel-Spectrograms

	- Audio params: `sr=22050`, `n_fft=2048`, `hop_length=512`, `n_mels=128`
	- Transform:
	1. Load mono audio: \( y \in \mathbb{R}^{T} \)
	2. Mel power spectrogram: \( S = \text{MelSpec}(y; sr, n\_mels, n\_fft, hop) \)
	3. Log scaling (dB): \( S_{dB} = 10 \log_{10} \left(\frac{S}{\max(S)}\right) \)
	- Rendering: `librosa.display.specshow(S_db, cmap="magma")`, save to PNG, no axes, `224×224` target size.
	- Normalization: Divide pixel values by `255.0` at model input.

	All scripts use the same constants to ensure train/test consistency.

	---

	### 4) Dataset Construction

	- Stage-1 dataset: `MelSpectrograms/` with the two top-level folders (`00 - Abnormal`, `01 - Normal`).
	- Stage-2 datasets:
	- Abnormal head: `MelSpectrograms/00 - Abnormal/*`
	- Normal head: `MelSpectrograms/01 - Normal/*`
	- Splits: `validation_split=0.2`, `seed=42` via `image_dataset_from_directory`.
	- Class Order: Persisted in `saved_models/label_meta.json` to guarantee consistent label ↔ index mapping at inference.

	---

	### 5) Models & Architecture

	Both stages use a compact CNN to keep inference light:

	- Backbone (per head):
	- `Conv2D(32, 3×3) → ReLU → MaxPool(2×2)`
	- `Conv2D(64, 3×3) → ReLU → MaxPool(2×2)`
	- `Conv2D(128, 3×3) → ReLU → MaxPool(2×2)`
	- `Flatten → Dense(128) → ReLU → Dropout(0.3) → Dense(num_classes) → Softmax`
	- Input: `224×224×3` spectrogram images
	- Loss: `SparseCategoricalCrossentropy`
	- Optimizer: `Adam`
	- Metrics: `Accuracy`

	> Rationale: A simple CNN is sufficient for a strong baseline; the hierarchy offloads fine-grained distinctions to specialized heads.

	---

	### 6) Training Protocol

	- Stage-1: Train on `Normal` vs `Abnormal` spectrograms.
	- Stage-2 Abnormal: Train only on abnormal subclasses.
	- Stage-2 Normal: Train only on normal subclasses.
	- Epochs: `10` (baseline; tune as needed)
	- Batch size: `32`
	- Pipelines: `cache → (shuffle) → prefetch` with `tf.data.AUTOTUNE`
	- Checkpointing: Save each head to `saved_models/*.h5` and class orders to `label_meta.json`.

	Optional (recommended):
	- Augmentations: time masking, frequency masking, Gaussian noise on spectrograms, random time shifts on audio.
	- Class imbalance: oversampling minority subclasses or focal loss in Stage-2 heads.

	---

	### 7) Inference Flow (Hierarchical)

	Input: `.wav` → Mel-spectrogram → `224×224`

	1. Stage-1: `p_stage1 = f_stage1(img)` → `y1 = argmax(p_stage1)`

	2. Route:
	- If `y1 == "00 - Abnormal"` → use `abnormal_model`
	- Else → use `normal_model`

	3. Stage-2: `p_stage2 = f_head(img)` → `y2 = argmax(p_stage2)`

	4. Output:
	`final = f"{y1.split(' - ')[1]} → {class2}"`
	plus confidences: `max(p_stage1)`, `max(p_stage2)`

	Pseudocode
	```python
	spec = to_mel_spectrogram(wav)
	img = preprocess(spec) # 224x224, /255.0

	p1 = stage1_model(img) # [2]
	y1 = argmax(p1)

	head = abnormal_model if y1_is_abnormal else normal_model
	p2 = head(img) # [num_subclasses]
	y2 = argmax(p2)

	return {
	"stage1_class": class_names_stage1[y1],
	"stage1_confidence": max(p1),
	"stage2_class": class_names_stage2[y2],
	"stage2_confidence": max(p2),
	"final_prediction": ...
	}

	```

	### 8) Evaluation
	- Per-stage metrics: accuracy, macro-F1, confusion matrices.

	- End-to-end metric: hierarchical accuracy = % of samples where both Stage-1 and Stage-2 predictions are correct.

	- Calibration: reliability curves / ECE on max_softmax for Stage-1 and Stage-2; optionally apply temperature scaling.

	- Robustness checks: background noise levels, recording device variance, different drum loads.

	- Leakage control: ensure clips from the same recording session are in one split only.

	### 9) Deployment Considerations
	- App: Gradio front-end calls the same spectrogram + inference pipeline.

	- Artifacts: saved_models/{stage1,abnormal,normal}.h5 + saved_models/label_meta.json

	- Reproducibility: fixed audio/spectrogram params and consistent class order.

	- Latency: spectrogram generation dominates; keep n_fft/hop_length fixed and consider caching frequent uploads.

	### 10) Limitations & Future Work
	- Domain shift: different washers/rooms/mics can reduce accuracy → consider domain adaptation / augmentation.

	- Simple CNN: replace with MobileNetV2/EfficientNet for improved accuracy at similar latency.

	- Sequence modeling: incorporate temporal context (e.g., ConvLSTM / Transformer over spectrogram patches).

	- On-device: quantize models (TFLite) for edge deployment.