Constant Wake 0.5 β€” 180 KB Spoken Wake Word Detection

A 180 KB keyword spotting model that detects the wake word "marvin" with 99.83% accuracy and zero false positives in streaming evaluation. Built for microcontrollers.

Metric Value
Test Accuracy 99.83%
Precision 97.83%
Recall 92.31%
F1 0.950
Model Size 180 KB (ONNX)
Parameters 45,570
Streaming FP 0 (target: ≀8)
Streaming FN 1 (target: ≀8)
MLPerf Tiny Target β‰₯95% accuracy β€” exceeded by 4.83 points

Architecture

1D Depthwise Separable CNN (DS-CNN) with energy-gated cascade:

Audio Input
  β†’ Energy Gate (silence rejection)
    β†’ FFT Feature Extraction
      β†’ 1D DS-CNN (64 channels)
        β†’ Classification (wake / not-wake)
  • Stage 1: Energy-based silence gating (STE) β€” rejects silence frames before any CNN computation
  • Stage 2: FFT feature extraction β€” MFCC-like spectral features
  • Stage 3: 1D Depthwise Separable CNN β€” 64 channels, highly parameter-efficient
  • Total: 45,570 parameters in 180 KB

The cascade architecture means the CNN only activates on non-silent frames, dramatically reducing power consumption on always-listening devices.

Benchmark Results

Classification (Static Test Set)

Count
True Positives 180
False Positives 4
True Negatives 10,806
False Negatives 15

Streaming Evaluation (200s continuous audio)

Metric Result Target Status
False Positives 0 ≀8 PASS
False Negatives 1 ≀8 PASS
CNN Activations 3 β€” Ultra-low power

Only 3 CNN activations in 200 seconds of streaming β€” the energy gate rejects 98.5% of frames before reaching the CNN.

Quick Start

import onnxruntime as ort
import numpy as np

# Load model
session = ort.InferenceSession("sww_dscnn.onnx")

# Input: MFCC features, shape depends on your audio preprocessing
# Typical: [batch, time_steps, n_mfcc]
input_name = session.get_inputs()[0].name
input_shape = session.get_inputs()[0].shape
print(f"Expected input: {input_name}, shape: {input_shape}")

# Run inference
features = np.random.randn(*[1 if isinstance(d, str) else d for d in input_shape]).astype(np.float32)
output = session.run(None, {input_name: features})[0]
print(f"Output shape: {output.shape}")

Hardware Targets

Platform Expected Latency Power
ARM Cortex-M4 (STM32L4) <15ms <1mW (with energy gate)
ARM Cortex-M7 (STM32H7) <5ms <2mW
ESP32-S3 <10ms <5mW
Raspberry Pi Pico <20ms <0.5mW

The energy-gated cascade ensures the CNN runs only when speech energy is detected, enabling always-on listening at sub-milliwatt power budgets.

MLPerf Tiny Compliance

This model targets the Keyword Spotting (KWS) benchmark from MLPerf Tiny:

  • Dataset: Google Speech Commands v0.02
  • Task: Streaming keyword detection
  • Target: β‰₯95% accuracy with ≀8 FP and ≀8 FN in streaming
  • Result: 99.83% accuracy, 0 FP, 1 FN β€” all targets exceeded

Training Details

  • Dataset: Google Speech Commands v0.02 (65,000+ 1-second audio clips)
  • Wake word: "marvin"
  • Architecture: Energy-Gated 1D DS-CNN, 64 channels
  • Epochs: 30
  • Hardware: NVIDIA RTX 4090

Use Cases

  • Smart home devices β€” always-on wake word detection at <1mW
  • Wearables β€” hearing aids, fitness bands, smartwatches
  • Industrial IoT β€” voice-activated controls in noisy environments
  • Automotive β€” in-cabin voice trigger without cloud connectivity
  • Medical devices β€” hands-free activation for clinical tools

Citation

@misc{constantone2026wake,
  title={Constant Wake: Energy-Gated Keyword Spotting for Microcontrollers},
  author={ConstantOne AI},
  year={2026},
  url={https://huggingface.co/ConstantQJ/constant-wake-0.5}
}

License

Apache 2.0 β€” use freely in commercial and non-commercial projects.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train ConstantQJ/constant-wake-0.5

Evaluation results