constant-wake-0.5 / README.md

Upload folder using huggingface_hub

a6eb28a verified 3 days ago

5.19 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: onnxruntime
	pipeline_tag: audio-classification
	tags:
	- keyword-spotting
	- wake-word
	- edge-ai
	- tinyml
	- onnx
	- microcontroller
	- speech
	- mlperf-tiny
	- dscnn
	datasets:
	- speech_commands
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: constant-wake-0.5
	results:
	- task:
	type: audio-classification
	name: Keyword Spotting
	dataset:
	type: speech_commands
	name: Google Speech Commands v0.02
	split: test
	metrics:
	- type: accuracy
	value: 99.83
	- type: f1
	value: 0.950
	- type: precision
	value: 0.978
	- type: recall
	value: 0.923
	---

	# Constant Wake 0.5 — 180 KB Spoken Wake Word Detection

	A 180 KB keyword spotting model that detects the wake word "marvin" with 99.83% accuracy and zero false positives in streaming evaluation. Built for microcontrollers.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Test Accuracy \| 99.83% \|
	\| Precision \| 97.83% \|
	\| Recall \| 92.31% \|
	\| F1 \| 0.950 \|
	\| Model Size \| 180 KB (ONNX) \|
	\| Parameters \| 45,570 \|
	\| Streaming FP \| 0 (target: ≤8) \|
	\| Streaming FN \| 1 (target: ≤8) \|
	\| MLPerf Tiny Target \| ≥95% accuracy — exceeded by 4.83 points \|

	## Architecture

	1D Depthwise Separable CNN (DS-CNN) with energy-gated cascade:

	```
	Audio Input
	→ Energy Gate (silence rejection)
	→ FFT Feature Extraction
	→ 1D DS-CNN (64 channels)
	→ Classification (wake / not-wake)
	```

	- Stage 1: Energy-based silence gating (STE) — rejects silence frames before any CNN computation
	- Stage 2: FFT feature extraction — MFCC-like spectral features
	- Stage 3: 1D Depthwise Separable CNN — 64 channels, highly parameter-efficient
	- Total: 45,570 parameters in 180 KB

	The cascade architecture means the CNN only activates on non-silent frames, dramatically reducing power consumption on always-listening devices.

	## Benchmark Results

	### Classification (Static Test Set)

	\| \| Count \|
	\|---\|---\|
	\| True Positives \| 180 \|
	\| False Positives \| 4 \|
	\| True Negatives \| 10,806 \|
	\| False Negatives \| 15 \|

	### Streaming Evaluation (200s continuous audio)

	\| Metric \| Result \| Target \| Status \|
	\|--------\|--------\|--------\|--------\|
	\| False Positives \| 0 \| ≤8 \| PASS \|
	\| False Negatives \| 1 \| ≤8 \| PASS \|
	\| CNN Activations \| 3 \| — \| Ultra-low power \|

	Only 3 CNN activations in 200 seconds of streaming — the energy gate rejects 98.5% of frames before reaching the CNN.

	## Quick Start

	```python
	import onnxruntime as ort
	import numpy as np

	# Load model
	session = ort.InferenceSession("sww_dscnn.onnx")

	# Input: MFCC features, shape depends on your audio preprocessing
	# Typical: [batch, time_steps, n_mfcc]
	input_name = session.get_inputs()[0].name
	input_shape = session.get_inputs()[0].shape
	print(f"Expected input: {input_name}, shape: {input_shape}")

	# Run inference
	features = np.random.randn(*[1 if isinstance(d, str) else d for d in input_shape]).astype(np.float32)
	output = session.run(None, {input_name: features})[0]
	print(f"Output shape: {output.shape}")
	```

	## Hardware Targets

	\| Platform \| Expected Latency \| Power \|
	\|----------\|-----------------\|-------\|
	\| ARM Cortex-M4 (STM32L4) \| <15ms \| <1mW (with energy gate) \|
	\| ARM Cortex-M7 (STM32H7) \| <5ms \| <2mW \|
	\| ESP32-S3 \| <10ms \| <5mW \|
	\| Raspberry Pi Pico \| <20ms \| <0.5mW \|

	The energy-gated cascade ensures the CNN runs only when speech energy is detected, enabling always-on listening at sub-milliwatt power budgets.

	## MLPerf Tiny Compliance

	This model targets the Keyword Spotting (KWS) benchmark from [MLPerf Tiny](https://mlcommons.org/benchmarks/inference-tiny/):

	- Dataset: Google Speech Commands v0.02
	- Task: Streaming keyword detection
	- Target: ≥95% accuracy with ≤8 FP and ≤8 FN in streaming
	- Result: 99.83% accuracy, 0 FP, 1 FN — all targets exceeded

	## Training Details

	- Dataset: Google Speech Commands v0.02 (65,000+ 1-second audio clips)
	- Wake word: "marvin"
	- Architecture: Energy-Gated 1D DS-CNN, 64 channels
	- Epochs: 30
	- Hardware: NVIDIA RTX 4090

	## Use Cases

	- Smart home devices — always-on wake word detection at <1mW
	- Wearables — hearing aids, fitness bands, smartwatches
	- Industrial IoT — voice-activated controls in noisy environments
	- Automotive — in-cabin voice trigger without cloud connectivity
	- Medical devices — hands-free activation for clinical tools

	## Citation

	```bibtex
	@misc{constantone2026wake,
	title={Constant Wake: Energy-Gated Keyword Spotting for Microcontrollers},
	author={ConstantOne AI},
	year={2026},
	url={https://huggingface.co/ConstantQJ/constant-wake-0.5}
	}
	```

	## License

	Apache 2.0 — use freely in commercial and non-commercial projects.

	## Links

	- [ConstantOne AI](https://constantone.ai)
	- [Constant Edge 0.5 (Sentiment)](https://huggingface.co/ConstantQJ/constant-edge-0.5) — 1.46 MB sentiment analysis
	- [API Documentation](https://constantone.ai/docs.html)