constant-wake-0.5 / README.md
ConstantQJ's picture
Upload folder using huggingface_hub
a6eb28a verified
---
license: apache-2.0
language:
- en
library_name: onnxruntime
pipeline_tag: audio-classification
tags:
- keyword-spotting
- wake-word
- edge-ai
- tinyml
- onnx
- microcontroller
- speech
- mlperf-tiny
- dscnn
datasets:
- speech_commands
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: constant-wake-0.5
results:
- task:
type: audio-classification
name: Keyword Spotting
dataset:
type: speech_commands
name: Google Speech Commands v0.02
split: test
metrics:
- type: accuracy
value: 99.83
- type: f1
value: 0.950
- type: precision
value: 0.978
- type: recall
value: 0.923
---
# Constant Wake 0.5 β€” 180 KB Spoken Wake Word Detection
A **180 KB** keyword spotting model that detects the wake word "marvin" with **99.83% accuracy** and **zero false positives** in streaming evaluation. Built for microcontrollers.
| Metric | Value |
|--------|-------|
| **Test Accuracy** | 99.83% |
| **Precision** | 97.83% |
| **Recall** | 92.31% |
| **F1** | 0.950 |
| **Model Size** | 180 KB (ONNX) |
| **Parameters** | 45,570 |
| **Streaming FP** | 0 (target: ≀8) |
| **Streaming FN** | 1 (target: ≀8) |
| **MLPerf Tiny Target** | β‰₯95% accuracy β€” **exceeded by 4.83 points** |
## Architecture
**1D Depthwise Separable CNN (DS-CNN)** with energy-gated cascade:
```
Audio Input
β†’ Energy Gate (silence rejection)
β†’ FFT Feature Extraction
β†’ 1D DS-CNN (64 channels)
β†’ Classification (wake / not-wake)
```
- **Stage 1**: Energy-based silence gating (STE) β€” rejects silence frames before any CNN computation
- **Stage 2**: FFT feature extraction β€” MFCC-like spectral features
- **Stage 3**: 1D Depthwise Separable CNN β€” 64 channels, highly parameter-efficient
- **Total**: 45,570 parameters in 180 KB
The cascade architecture means the CNN only activates on non-silent frames, dramatically reducing power consumption on always-listening devices.
## Benchmark Results
### Classification (Static Test Set)
| | Count |
|---|---|
| True Positives | 180 |
| False Positives | 4 |
| True Negatives | 10,806 |
| False Negatives | 15 |
### Streaming Evaluation (200s continuous audio)
| Metric | Result | Target | Status |
|--------|--------|--------|--------|
| False Positives | 0 | ≀8 | **PASS** |
| False Negatives | 1 | ≀8 | **PASS** |
| CNN Activations | 3 | β€” | Ultra-low power |
Only **3 CNN activations** in 200 seconds of streaming β€” the energy gate rejects 98.5% of frames before reaching the CNN.
## Quick Start
```python
import onnxruntime as ort
import numpy as np
# Load model
session = ort.InferenceSession("sww_dscnn.onnx")
# Input: MFCC features, shape depends on your audio preprocessing
# Typical: [batch, time_steps, n_mfcc]
input_name = session.get_inputs()[0].name
input_shape = session.get_inputs()[0].shape
print(f"Expected input: {input_name}, shape: {input_shape}")
# Run inference
features = np.random.randn(*[1 if isinstance(d, str) else d for d in input_shape]).astype(np.float32)
output = session.run(None, {input_name: features})[0]
print(f"Output shape: {output.shape}")
```
## Hardware Targets
| Platform | Expected Latency | Power |
|----------|-----------------|-------|
| ARM Cortex-M4 (STM32L4) | <15ms | <1mW (with energy gate) |
| ARM Cortex-M7 (STM32H7) | <5ms | <2mW |
| ESP32-S3 | <10ms | <5mW |
| Raspberry Pi Pico | <20ms | <0.5mW |
The energy-gated cascade ensures the CNN runs only when speech energy is detected, enabling always-on listening at sub-milliwatt power budgets.
## MLPerf Tiny Compliance
This model targets the **Keyword Spotting (KWS)** benchmark from [MLPerf Tiny](https://mlcommons.org/benchmarks/inference-tiny/):
- **Dataset**: Google Speech Commands v0.02
- **Task**: Streaming keyword detection
- **Target**: β‰₯95% accuracy with ≀8 FP and ≀8 FN in streaming
- **Result**: 99.83% accuracy, 0 FP, 1 FN β€” **all targets exceeded**
## Training Details
- **Dataset**: Google Speech Commands v0.02 (65,000+ 1-second audio clips)
- **Wake word**: "marvin"
- **Architecture**: Energy-Gated 1D DS-CNN, 64 channels
- **Epochs**: 30
- **Hardware**: NVIDIA RTX 4090
## Use Cases
- **Smart home devices** β€” always-on wake word detection at <1mW
- **Wearables** β€” hearing aids, fitness bands, smartwatches
- **Industrial IoT** β€” voice-activated controls in noisy environments
- **Automotive** β€” in-cabin voice trigger without cloud connectivity
- **Medical devices** β€” hands-free activation for clinical tools
## Citation
```bibtex
@misc{constantone2026wake,
title={Constant Wake: Energy-Gated Keyword Spotting for Microcontrollers},
author={ConstantOne AI},
year={2026},
url={https://huggingface.co/ConstantQJ/constant-wake-0.5}
}
```
## License
Apache 2.0 β€” use freely in commercial and non-commercial projects.
## Links
- [ConstantOne AI](https://constantone.ai)
- [Constant Edge 0.5 (Sentiment)](https://huggingface.co/ConstantQJ/constant-edge-0.5) β€” 1.46 MB sentiment analysis
- [API Documentation](https://constantone.ai/docs.html)