Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,57 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- audio
|
| 5 |
+
- classification
|
| 6 |
+
- alarm
|
| 7 |
+
- siren
|
| 8 |
+
- ambulance
|
| 9 |
+
- police
|
| 10 |
+
- security
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# audio-alert-detector
|
| 14 |
+
|
| 15 |
+
Tiny CNN that classifies a 10-second audio clip as alert (siren, smoke/fire alarm, car alarm, house/burglar alarm) vs not-alert.
|
| 16 |
+
Designed to run on a Raspberry Pi Zero 2 in the ONNX runtime with no accelerator.
|
| 17 |
+
|
| 18 |
+
## Model
|
| 19 |
+
|
| 20 |
+
- Depthwise-separable CNN (~35k params, ~17M MACs)
|
| 21 |
+
- Input: raw mono 16 kHz PCM, 10 s = 160k samples
|
| 22 |
+
- Embedded log-mel frontend (no host-side preprocessing needed beyond resampling to 16kHz)
|
| 23 |
+
- Two heads: binary (alert/not) + auxiliary subclass (siren/alarm)
|
| 24 |
+
- Δ + ΔΔ time-derivative input channels for onset/sweep dynamics
|
| 25 |
+
|
| 26 |
+
## Training data
|
| 27 |
+
|
| 28 |
+
- ~11k positives + ~60k negatives, 10 s each, 16 kHz mono
|
| 29 |
+
- AudioSet via [confit/audioset-full](https://huggingface.co/datasets/confit/audioset-full) HF mirror
|
| 30 |
+
- Targeted hard-negative mid lists (bells, whistles, woodwinds, mechanical, instruments, animals, music) for known FP categories
|
| 31 |
+
- Curated supplemental positives: ~199 country-specific EAS alarms + ~107 nuclear/civil-defense sirens
|
| 32 |
+
|
| 33 |
+
## Training
|
| 34 |
+
|
| 35 |
+
- 40 epochs, AdamW (lr 3e-4, wd 1e-4), cosine LR schedule
|
| 36 |
+
- Batch 64, fixed 40% positive fraction per batch (uniform within each pool)
|
| 37 |
+
- Loss: binary BCE + 0.3 × subclass CE (masked to positives)
|
| 38 |
+
|
| 39 |
+
## Augmentation (mel-space)
|
| 40 |
+
|
| 41 |
+
- Random time-stretch (0.9–1.1×)
|
| 42 |
+
- Random gain (–45 to +15 dB)
|
| 43 |
+
- Frequency shift (±4 mel bins ≈ ±2 semitones)
|
| 44 |
+
- Companding (γ 0.75–1.25, p=0.3)
|
| 45 |
+
- SpecAugment time + freq masks
|
| 46 |
+
- Curated ambience overlay (rain / cafe / road traffic / mic noise floor / etc.) at 25% RMS, applied to both classes
|
| 47 |
+
|
| 48 |
+
## Deployment
|
| 49 |
+
|
| 50 |
+
Single-file fp32 ONNX (348 KB). Input: `float32[batch, 160000]` raw 16 kHz mono PCM. Outputs: `binary_logit` + `subclass_logits[2]`. Apply sigmoid for alert probability.
|
| 51 |
+
|
| 52 |
+
Recommended: 10s ring buffer, infer every 2 s, threshold 0.5, require 2 consecutive over-threshold windows to fire (eliminates almost all single-window FPs).
|
| 53 |
+
|
| 54 |
+
## Performance
|
| 55 |
+
|
| 56 |
+
- Test set (225 curated clips): 88% acc / 89% prec / 83% rec / F1 0.86
|
| 57 |
+
- Pi Zero 2 (Cortex-A53, fp32, single thread): ~150 ms / 10 s window
|