--- license: apache-2.0 language: - en library_name: onnxruntime pipeline_tag: audio-classification tags: - keyword-spotting - wake-word - edge-ai - tinyml - onnx - microcontroller - speech - mlperf-tiny - dscnn datasets: - speech_commands metrics: - accuracy - f1 - precision - recall model-index: - name: constant-wake-0.5 results: - task: type: audio-classification name: Keyword Spotting dataset: type: speech_commands name: Google Speech Commands v0.02 split: test metrics: - type: accuracy value: 99.83 - type: f1 value: 0.950 - type: precision value: 0.978 - type: recall value: 0.923 --- # Constant Wake 0.5 — 180 KB Spoken Wake Word Detection A **180 KB** keyword spotting model that detects the wake word "marvin" with **99.83% accuracy** and **zero false positives** in streaming evaluation. Built for microcontrollers. | Metric | Value | |--------|-------| | **Test Accuracy** | 99.83% | | **Precision** | 97.83% | | **Recall** | 92.31% | | **F1** | 0.950 | | **Model Size** | 180 KB (ONNX) | | **Parameters** | 45,570 | | **Streaming FP** | 0 (target: ≤8) | | **Streaming FN** | 1 (target: ≤8) | | **MLPerf Tiny Target** | ≥95% accuracy — **exceeded by 4.83 points** | ## Architecture **1D Depthwise Separable CNN (DS-CNN)** with energy-gated cascade: ``` Audio Input → Energy Gate (silence rejection) → FFT Feature Extraction → 1D DS-CNN (64 channels) → Classification (wake / not-wake) ``` - **Stage 1**: Energy-based silence gating (STE) — rejects silence frames before any CNN computation - **Stage 2**: FFT feature extraction — MFCC-like spectral features - **Stage 3**: 1D Depthwise Separable CNN — 64 channels, highly parameter-efficient - **Total**: 45,570 parameters in 180 KB The cascade architecture means the CNN only activates on non-silent frames, dramatically reducing power consumption on always-listening devices. ## Benchmark Results ### Classification (Static Test Set) | | Count | |---|---| | True Positives | 180 | | False Positives | 4 | | True Negatives | 10,806 | | False Negatives | 15 | ### Streaming Evaluation (200s continuous audio) | Metric | Result | Target | Status | |--------|--------|--------|--------| | False Positives | 0 | ≤8 | **PASS** | | False Negatives | 1 | ≤8 | **PASS** | | CNN Activations | 3 | — | Ultra-low power | Only **3 CNN activations** in 200 seconds of streaming — the energy gate rejects 98.5% of frames before reaching the CNN. ## Quick Start ```python import onnxruntime as ort import numpy as np # Load model session = ort.InferenceSession("sww_dscnn.onnx") # Input: MFCC features, shape depends on your audio preprocessing # Typical: [batch, time_steps, n_mfcc] input_name = session.get_inputs()[0].name input_shape = session.get_inputs()[0].shape print(f"Expected input: {input_name}, shape: {input_shape}") # Run inference features = np.random.randn(*[1 if isinstance(d, str) else d for d in input_shape]).astype(np.float32) output = session.run(None, {input_name: features})[0] print(f"Output shape: {output.shape}") ``` ## Hardware Targets | Platform | Expected Latency | Power | |----------|-----------------|-------| | ARM Cortex-M4 (STM32L4) | <15ms | <1mW (with energy gate) | | ARM Cortex-M7 (STM32H7) | <5ms | <2mW | | ESP32-S3 | <10ms | <5mW | | Raspberry Pi Pico | <20ms | <0.5mW | The energy-gated cascade ensures the CNN runs only when speech energy is detected, enabling always-on listening at sub-milliwatt power budgets. ## MLPerf Tiny Compliance This model targets the **Keyword Spotting (KWS)** benchmark from [MLPerf Tiny](https://mlcommons.org/benchmarks/inference-tiny/): - **Dataset**: Google Speech Commands v0.02 - **Task**: Streaming keyword detection - **Target**: ≥95% accuracy with ≤8 FP and ≤8 FN in streaming - **Result**: 99.83% accuracy, 0 FP, 1 FN — **all targets exceeded** ## Training Details - **Dataset**: Google Speech Commands v0.02 (65,000+ 1-second audio clips) - **Wake word**: "marvin" - **Architecture**: Energy-Gated 1D DS-CNN, 64 channels - **Epochs**: 30 - **Hardware**: NVIDIA RTX 4090 ## Use Cases - **Smart home devices** — always-on wake word detection at <1mW - **Wearables** — hearing aids, fitness bands, smartwatches - **Industrial IoT** — voice-activated controls in noisy environments - **Automotive** — in-cabin voice trigger without cloud connectivity - **Medical devices** — hands-free activation for clinical tools ## Citation ```bibtex @misc{constantone2026wake, title={Constant Wake: Energy-Gated Keyword Spotting for Microcontrollers}, author={ConstantOne AI}, year={2026}, url={https://huggingface.co/ConstantQJ/constant-wake-0.5} } ``` ## License Apache 2.0 — use freely in commercial and non-commercial projects. ## Links - [ConstantOne AI](https://constantone.ai) - [Constant Edge 0.5 (Sentiment)](https://huggingface.co/ConstantQJ/constant-edge-0.5) — 1.46 MB sentiment analysis - [API Documentation](https://constantone.ai/docs.html)