Real-Time Audio Event Detection for Threat Assessment on Mobile Devices
White Paper: Edge Deployment, Model Architectures, and Multi-Modal Fusion with Wearable Physiological Signals
Author: Aditya Raikar
Overview
This repository contains a comprehensive research white paper on building a real-time audio event detection (AED) system for threat assessment on Android mobile devices. The paper covers:
- Edge vs. Cloud Deployment Analysis β Why edge-first deployment is optimal for threat detection (latency, privacy, offline operation)
- Model Architecture Taxonomy β From ultra-lightweight (6K params) to SOTA (90M params), with published AudioSet benchmarks
- Multi-Modal Fusion β Novel proposition to combine audio threat detection with smartwatch physiological signals (HR, HRV, EDA, accelerometer) for false-positive reduction
- Android System Architecture β Five-layer design with component-level specifications and latency budget
- Open-Source Prototypes & Datasets β Curated list of available resources for development
Key Findings
| Aspect | Recommendation |
|---|---|
| Deployment | Edge-first with optional cloud fallback |
| Primary Model | EfficientAT-MN40 (~4M params, 0.47 mAP AudioSet) |
| Alternative | YAMNet via MediaPipe (production-ready, TFLite native) |
| Fusion | Late fusion + attention gating with smartwatch vitals |
| Latency Target | < 100ms end-to-end (achievable: ~50-75ms) |
| Inference Engine | TensorFlow Lite with GPU/NNAPI delegate |
Novel Contribution
No prior work has combined real-time audio event detection with smartwatch physiological signals specifically for threat assessment. This paper proposes using involuntary physiological responses (startle reflex, fight-or-flight) as confirmation signals for audio-detected threats.
Key References
| Paper | Contribution | ArXiv |
|---|---|---|
| EfficientAT (Schmid et al., 2023) | SOTA efficient audio tagging via distillation | ICASSP 2023 |
| BEATs (Chen et al., 2023) | 0.507 mAP AudioSet with acoustic tokenizers | ICML 2023 |
| AST (Gong et al., 2021) | Pure-attention audio classifier | 2104.01778 |
| PANNs (Kong et al., 2020) | Large-scale pretrained audio CNNs | 1912.10211 |
| Gunshot Detection (2025) | CNN-based firearm classification | 2506.20609 |
| WESAD Cross-Modality (2025) | 99.95% stress detection from wearables | 2502.18733 |
| CognitiveEMS (2024) | Multi-modal emergency assistant on edge | 2403.06734 |
| Cross-Modal Violence (2024) | Audio-visual anomaly detection fusion | 2412.20455 |
Open-Source Prototypes Referenced
- vivsvaan/Gunshot-Detection β CNN gunshot classifier
- fschmid56/EfficientAT β Efficient audio tagging models
- TensorFlow Audio Classification Android β TFLite demo app
- MediaPipe Audio Classifier β Production API
- microsoft/unilm/beats β BEATs SOTA model
- WJMatthew/WESAD β Wearable stress detection
How to Compile the Paper
pdflatex whitepaper.tex
pdflatex whitepaper.tex # Run twice for TOC and references
Requires: texlive-latex-extra, texlive-pictures (for TikZ diagrams)
License
This research white paper is provided for educational and research purposes.
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "AdityaRaikar/threat-detection-audio-whitepaper"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.