File size: 2,134 Bytes
edf3014
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
license: mit
tags:
- audio
- classification
- alarm
- siren
- ambulance
- police
- security
---

# audio-alert-detector

Tiny CNN that classifies a 10-second audio clip as alert (siren, smoke/fire alarm, car alarm, house/burglar alarm) vs not-alert. 
Designed to run on a Raspberry Pi Zero 2 in the ONNX runtime with no accelerator.

## Model

- Depthwise-separable CNN (~35k params, ~17M MACs)
- Input: raw mono 16 kHz PCM, 10 s = 160k samples
- Embedded log-mel frontend (no host-side preprocessing needed beyond resampling to 16kHz)
- Two heads: binary (alert/not) + auxiliary subclass (siren/alarm)
- Δ + ΔΔ time-derivative input channels for onset/sweep dynamics

## Training data

- ~11k positives + ~60k negatives, 10 s each, 16 kHz mono
- AudioSet via [confit/audioset-full](https://huggingface.co/datasets/confit/audioset-full) HF mirror
- Targeted hard-negative mid lists (bells, whistles, woodwinds, mechanical, instruments, animals, music) for known FP categories
- Curated supplemental positives: ~199 country-specific EAS alarms + ~107 nuclear/civil-defense sirens

## Training

- 40 epochs, AdamW (lr 3e-4, wd 1e-4), cosine LR schedule
- Batch 64, fixed 40% positive fraction per batch (uniform within each pool)
- Loss: binary BCE + 0.3 × subclass CE (masked to positives)

## Augmentation (mel-space)

- Random time-stretch (0.9–1.1×)
- Random gain (–45 to +15 dB)
- Frequency shift (±4 mel bins ≈ ±2 semitones)
- Companding (γ 0.75–1.25, p=0.3)
- SpecAugment time + freq masks
- Curated ambience overlay (rain / cafe / road traffic / mic noise floor / etc.) at 25% RMS, applied to both classes

## Deployment

Single-file fp32 ONNX (348 KB). Input: `float32[batch, 160000]` raw 16 kHz mono PCM. Outputs: `binary_logit` + `subclass_logits[2]`. Apply sigmoid for alert probability.

Recommended: 10s ring buffer, infer every 2 s, threshold 0.5, require 2 consecutive over-threshold windows to fire (eliminates almost all single-window FPs).

## Performance

- Test set (225 curated clips): 88% acc / 89% prec / 83% rec / F1 0.86
- Pi Zero 2 (Cortex-A53, fp32, single thread): ~150 ms / 10 s window