| --- |
| license: mit |
| tags: |
| - audio |
| - classification |
| - alarm |
| - siren |
| - ambulance |
| - police |
| - security |
| --- |
| |
| # audio-alert-detector |
|
|
| Tiny CNN that classifies a 10-second audio clip as alert (siren, smoke/fire alarm, car alarm, house/burglar alarm) vs not-alert. |
| Designed to run on a Raspberry Pi Zero 2 in the ONNX runtime with no accelerator. |
|
|
| ## Model |
|
|
| - Depthwise-separable CNN (~35k params, ~17M MACs) |
| - Input: raw mono 16 kHz PCM, 10 s = 160k samples |
| - Embedded log-mel frontend (no host-side preprocessing needed beyond resampling to 16kHz) |
| - Two heads: binary (alert/not) + auxiliary subclass (siren/alarm) |
| - Δ + ΔΔ time-derivative input channels for onset/sweep dynamics |
|
|
| ## Training data |
|
|
| - ~11k positives + ~60k negatives, 10 s each, 16 kHz mono |
| - AudioSet via [confit/audioset-full](https://huggingface.co/datasets/confit/audioset-full) HF mirror |
| - Targeted hard-negative mid lists (bells, whistles, woodwinds, mechanical, instruments, animals, music) for known FP categories |
| - Curated supplemental positives: ~199 country-specific EAS alarms + ~107 nuclear/civil-defense sirens |
|
|
| ## Training |
|
|
| - 40 epochs, AdamW (lr 3e-4, wd 1e-4), cosine LR schedule |
| - Batch 64, fixed 40% positive fraction per batch (uniform within each pool) |
| - Loss: binary BCE + 0.3 × subclass CE (masked to positives) |
|
|
| ## Augmentation (mel-space) |
|
|
| - Random time-stretch (0.9–1.1×) |
| - Random gain (–45 to +15 dB) |
| - Frequency shift (±4 mel bins ≈ ±2 semitones) |
| - Companding (γ 0.75–1.25, p=0.3) |
| - SpecAugment time + freq masks |
| - Curated ambience overlay (rain / cafe / road traffic / mic noise floor / etc.) at 25% RMS, applied to both classes |
|
|
| ## Deployment |
|
|
| Single-file fp32 ONNX (348 KB). Input: `float32[batch, 160000]` raw 16 kHz mono PCM. Outputs: `binary_logit` + `subclass_logits[2]`. Apply sigmoid for alert probability. |
|
|
| Recommended: 10s ring buffer, infer every 2 s, threshold 0.5, require 2 consecutive over-threshold windows to fire (eliminates almost all single-window FPs). |
|
|
| ## Performance |
|
|
| - Test set (225 curated clips): 88% acc / 89% prec / 83% rec / F1 0.86 |
| - Pi Zero 2 (Cortex-A53, fp32, single thread): ~150 ms / 10 s window |
|
|