|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
library_name: onnxruntime |
|
|
pipeline_tag: audio-classification |
|
|
tags: |
|
|
- keyword-spotting |
|
|
- wake-word |
|
|
- edge-ai |
|
|
- tinyml |
|
|
- onnx |
|
|
- microcontroller |
|
|
- speech |
|
|
- mlperf-tiny |
|
|
- dscnn |
|
|
datasets: |
|
|
- speech_commands |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
model-index: |
|
|
- name: constant-wake-0.5 |
|
|
results: |
|
|
- task: |
|
|
type: audio-classification |
|
|
name: Keyword Spotting |
|
|
dataset: |
|
|
type: speech_commands |
|
|
name: Google Speech Commands v0.02 |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 99.83 |
|
|
- type: f1 |
|
|
value: 0.950 |
|
|
- type: precision |
|
|
value: 0.978 |
|
|
- type: recall |
|
|
value: 0.923 |
|
|
--- |
|
|
|
|
|
# Constant Wake 0.5 β 180 KB Spoken Wake Word Detection |
|
|
|
|
|
A **180 KB** keyword spotting model that detects the wake word "marvin" with **99.83% accuracy** and **zero false positives** in streaming evaluation. Built for microcontrollers. |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Test Accuracy** | 99.83% | |
|
|
| **Precision** | 97.83% | |
|
|
| **Recall** | 92.31% | |
|
|
| **F1** | 0.950 | |
|
|
| **Model Size** | 180 KB (ONNX) | |
|
|
| **Parameters** | 45,570 | |
|
|
| **Streaming FP** | 0 (target: β€8) | |
|
|
| **Streaming FN** | 1 (target: β€8) | |
|
|
| **MLPerf Tiny Target** | β₯95% accuracy β **exceeded by 4.83 points** | |
|
|
|
|
|
## Architecture |
|
|
|
|
|
**1D Depthwise Separable CNN (DS-CNN)** with energy-gated cascade: |
|
|
|
|
|
``` |
|
|
Audio Input |
|
|
β Energy Gate (silence rejection) |
|
|
β FFT Feature Extraction |
|
|
β 1D DS-CNN (64 channels) |
|
|
β Classification (wake / not-wake) |
|
|
``` |
|
|
|
|
|
- **Stage 1**: Energy-based silence gating (STE) β rejects silence frames before any CNN computation |
|
|
- **Stage 2**: FFT feature extraction β MFCC-like spectral features |
|
|
- **Stage 3**: 1D Depthwise Separable CNN β 64 channels, highly parameter-efficient |
|
|
- **Total**: 45,570 parameters in 180 KB |
|
|
|
|
|
The cascade architecture means the CNN only activates on non-silent frames, dramatically reducing power consumption on always-listening devices. |
|
|
|
|
|
## Benchmark Results |
|
|
|
|
|
### Classification (Static Test Set) |
|
|
|
|
|
| | Count | |
|
|
|---|---| |
|
|
| True Positives | 180 | |
|
|
| False Positives | 4 | |
|
|
| True Negatives | 10,806 | |
|
|
| False Negatives | 15 | |
|
|
|
|
|
### Streaming Evaluation (200s continuous audio) |
|
|
|
|
|
| Metric | Result | Target | Status | |
|
|
|--------|--------|--------|--------| |
|
|
| False Positives | 0 | β€8 | **PASS** | |
|
|
| False Negatives | 1 | β€8 | **PASS** | |
|
|
| CNN Activations | 3 | β | Ultra-low power | |
|
|
|
|
|
Only **3 CNN activations** in 200 seconds of streaming β the energy gate rejects 98.5% of frames before reaching the CNN. |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
import onnxruntime as ort |
|
|
import numpy as np |
|
|
|
|
|
# Load model |
|
|
session = ort.InferenceSession("sww_dscnn.onnx") |
|
|
|
|
|
# Input: MFCC features, shape depends on your audio preprocessing |
|
|
# Typical: [batch, time_steps, n_mfcc] |
|
|
input_name = session.get_inputs()[0].name |
|
|
input_shape = session.get_inputs()[0].shape |
|
|
print(f"Expected input: {input_name}, shape: {input_shape}") |
|
|
|
|
|
# Run inference |
|
|
features = np.random.randn(*[1 if isinstance(d, str) else d for d in input_shape]).astype(np.float32) |
|
|
output = session.run(None, {input_name: features})[0] |
|
|
print(f"Output shape: {output.shape}") |
|
|
``` |
|
|
|
|
|
## Hardware Targets |
|
|
|
|
|
| Platform | Expected Latency | Power | |
|
|
|----------|-----------------|-------| |
|
|
| ARM Cortex-M4 (STM32L4) | <15ms | <1mW (with energy gate) | |
|
|
| ARM Cortex-M7 (STM32H7) | <5ms | <2mW | |
|
|
| ESP32-S3 | <10ms | <5mW | |
|
|
| Raspberry Pi Pico | <20ms | <0.5mW | |
|
|
|
|
|
The energy-gated cascade ensures the CNN runs only when speech energy is detected, enabling always-on listening at sub-milliwatt power budgets. |
|
|
|
|
|
## MLPerf Tiny Compliance |
|
|
|
|
|
This model targets the **Keyword Spotting (KWS)** benchmark from [MLPerf Tiny](https://mlcommons.org/benchmarks/inference-tiny/): |
|
|
|
|
|
- **Dataset**: Google Speech Commands v0.02 |
|
|
- **Task**: Streaming keyword detection |
|
|
- **Target**: β₯95% accuracy with β€8 FP and β€8 FN in streaming |
|
|
- **Result**: 99.83% accuracy, 0 FP, 1 FN β **all targets exceeded** |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Dataset**: Google Speech Commands v0.02 (65,000+ 1-second audio clips) |
|
|
- **Wake word**: "marvin" |
|
|
- **Architecture**: Energy-Gated 1D DS-CNN, 64 channels |
|
|
- **Epochs**: 30 |
|
|
- **Hardware**: NVIDIA RTX 4090 |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- **Smart home devices** β always-on wake word detection at <1mW |
|
|
- **Wearables** β hearing aids, fitness bands, smartwatches |
|
|
- **Industrial IoT** β voice-activated controls in noisy environments |
|
|
- **Automotive** β in-cabin voice trigger without cloud connectivity |
|
|
- **Medical devices** β hands-free activation for clinical tools |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{constantone2026wake, |
|
|
title={Constant Wake: Energy-Gated Keyword Spotting for Microcontrollers}, |
|
|
author={ConstantOne AI}, |
|
|
year={2026}, |
|
|
url={https://huggingface.co/ConstantQJ/constant-wake-0.5} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 β use freely in commercial and non-commercial projects. |
|
|
|
|
|
## Links |
|
|
|
|
|
- [ConstantOne AI](https://constantone.ai) |
|
|
- [Constant Edge 0.5 (Sentiment)](https://huggingface.co/ConstantQJ/constant-edge-0.5) β 1.46 MB sentiment analysis |
|
|
- [API Documentation](https://constantone.ai/docs.html) |
|
|
|