Constant Wake 0.5 — 180 KB Spoken Wake Word Detection

A 180 KB keyword spotting model that detects the wake word "marvin" with 99.83% accuracy and zero false positives in streaming evaluation. Built for microcontrollers.

Metric	Value
Test Accuracy	99.83%
Precision	97.83%
Recall	92.31%
F1	0.950
Model Size	180 KB (ONNX)
Parameters	45,570
Streaming FP	0 (target: ≤8)
Streaming FN	1 (target: ≤8)
MLPerf Tiny Target	≥95% accuracy — exceeded by 4.83 points

Architecture

1D Depthwise Separable CNN (DS-CNN) with energy-gated cascade:

Audio Input
  → Energy Gate (silence rejection)
    → FFT Feature Extraction
      → 1D DS-CNN (64 channels)
        → Classification (wake / not-wake)

Stage 1: Energy-based silence gating (STE) — rejects silence frames before any CNN computation
Stage 2: FFT feature extraction — MFCC-like spectral features
Stage 3: 1D Depthwise Separable CNN — 64 channels, highly parameter-efficient
Total: 45,570 parameters in 180 KB

The cascade architecture means the CNN only activates on non-silent frames, dramatically reducing power consumption on always-listening devices.

Benchmark Results

Classification (Static Test Set)

	Count
True Positives	180
False Positives	4
True Negatives	10,806
False Negatives	15

Streaming Evaluation (200s continuous audio)

Metric	Result	Target	Status
False Positives	0	≤8	PASS
False Negatives	1	≤8	PASS
CNN Activations	3	—	Ultra-low power

Only 3 CNN activations in 200 seconds of streaming — the energy gate rejects 98.5% of frames before reaching the CNN.

Quick Start

import onnxruntime as ort
import numpy as np

# Load model
session = ort.InferenceSession("sww_dscnn.onnx")

# Input: MFCC features, shape depends on your audio preprocessing
# Typical: [batch, time_steps, n_mfcc]
input_name = session.get_inputs()[0].name
input_shape = session.get_inputs()[0].shape
print(f"Expected input: {input_name}, shape: {input_shape}")

# Run inference
features = np.random.randn(*[1 if isinstance(d, str) else d for d in input_shape]).astype(np.float32)
output = session.run(None, {input_name: features})[0]
print(f"Output shape: {output.shape}")

Hardware Targets

Platform	Expected Latency	Power
ARM Cortex-M4 (STM32L4)	<15ms	<1mW (with energy gate)
ARM Cortex-M7 (STM32H7)	<5ms	<2mW
ESP32-S3	<10ms	<5mW
Raspberry Pi Pico	<20ms	<0.5mW

The energy-gated cascade ensures the CNN runs only when speech energy is detected, enabling always-on listening at sub-milliwatt power budgets.

MLPerf Tiny Compliance

This model targets the Keyword Spotting (KWS) benchmark from MLPerf Tiny:

Dataset: Google Speech Commands v0.02
Task: Streaming keyword detection
Target: ≥95% accuracy with ≤8 FP and ≤8 FN in streaming
Result: 99.83% accuracy, 0 FP, 1 FN — all targets exceeded

Training Details

Dataset: Google Speech Commands v0.02 (65,000+ 1-second audio clips)
Wake word: "marvin"
Architecture: Energy-Gated 1D DS-CNN, 64 channels
Epochs: 30
Hardware: NVIDIA RTX 4090

Use Cases

Smart home devices — always-on wake word detection at <1mW
Wearables — hearing aids, fitness bands, smartwatches
Industrial IoT — voice-activated controls in noisy environments
Automotive — in-cabin voice trigger without cloud connectivity
Medical devices — hands-free activation for clinical tools

Citation

@misc{constantone2026wake,
  title={Constant Wake: Energy-Gated Keyword Spotting for Microcontrollers},
  author={ConstantOne AI},
  year={2026},
  url={https://huggingface.co/ConstantQJ/constant-wake-0.5}
}

ConstantQJ
/

constant-wake-0.5