---
license: apache-2.0
language:
  - en
library_name: onnxruntime
pipeline_tag: audio-classification
tags:
  - keyword-spotting
  - wake-word
  - edge-ai
  - tinyml
  - onnx
  - microcontroller
  - speech
  - mlperf-tiny
  - dscnn
datasets:
  - speech_commands
metrics:
  - accuracy
  - f1
  - precision
  - recall
model-index:
  - name: constant-wake-0.5
    results:
      - task:
          type: audio-classification
          name: Keyword Spotting
        dataset:
          type: speech_commands
          name: Google Speech Commands v0.02
          split: test
        metrics:
          - type: accuracy
            value: 99.83
          - type: f1
            value: 0.950
          - type: precision
            value: 0.978
          - type: recall
            value: 0.923
---

# Constant Wake 0.5 — 180 KB Spoken Wake Word Detection

A **180 KB** keyword spotting model that detects the wake word "marvin" with **99.83% accuracy** and **zero false positives** in streaming evaluation. Built for microcontrollers.

| Metric | Value |
|--------|-------|
| **Test Accuracy** | 99.83% |
| **Precision** | 97.83% |
| **Recall** | 92.31% |
| **F1** | 0.950 |
| **Model Size** | 180 KB (ONNX) |
| **Parameters** | 45,570 |
| **Streaming FP** | 0 (target: ≤8) |
| **Streaming FN** | 1 (target: ≤8) |
| **MLPerf Tiny Target** | ≥95% accuracy — **exceeded by 4.83 points** |

## Architecture

**1D Depthwise Separable CNN (DS-CNN)** with energy-gated cascade:

```
Audio Input
  → Energy Gate (silence rejection)
    → FFT Feature Extraction
      → 1D DS-CNN (64 channels)
        → Classification (wake / not-wake)
```

- **Stage 1**: Energy-based silence gating (STE) — rejects silence frames before any CNN computation
- **Stage 2**: FFT feature extraction — MFCC-like spectral features
- **Stage 3**: 1D Depthwise Separable CNN — 64 channels, highly parameter-efficient
- **Total**: 45,570 parameters in 180 KB

The cascade architecture means the CNN only activates on non-silent frames, dramatically reducing power consumption on always-listening devices.

## Benchmark Results

### Classification (Static Test Set)

| | Count |
|---|---|
| True Positives | 180 |
| False Positives | 4 |
| True Negatives | 10,806 |
| False Negatives | 15 |

### Streaming Evaluation (200s continuous audio)

| Metric | Result | Target | Status |
|--------|--------|--------|--------|
| False Positives | 0 | ≤8 | **PASS** |
| False Negatives | 1 | ≤8 | **PASS** |
| CNN Activations | 3 | — | Ultra-low power |

Only **3 CNN activations** in 200 seconds of streaming — the energy gate rejects 98.5% of frames before reaching the CNN.

## Quick Start

```python
import onnxruntime as ort
import numpy as np

# Load model
session = ort.InferenceSession("sww_dscnn.onnx")

# Input: MFCC features, shape depends on your audio preprocessing
# Typical: [batch, time_steps, n_mfcc]
input_name = session.get_inputs()[0].name
input_shape = session.get_inputs()[0].shape
print(f"Expected input: {input_name}, shape: {input_shape}")

# Run inference
features = np.random.randn(*[1 if isinstance(d, str) else d for d in input_shape]).astype(np.float32)
output = session.run(None, {input_name: features})[0]
print(f"Output shape: {output.shape}")
```

## Hardware Targets

| Platform | Expected Latency | Power |
|----------|-----------------|-------|
| ARM Cortex-M4 (STM32L4) | <15ms | <1mW (with energy gate) |
| ARM Cortex-M7 (STM32H7) | <5ms | <2mW |
| ESP32-S3 | <10ms | <5mW |
| Raspberry Pi Pico | <20ms | <0.5mW |

The energy-gated cascade ensures the CNN runs only when speech energy is detected, enabling always-on listening at sub-milliwatt power budgets.

## MLPerf Tiny Compliance

This model targets the **Keyword Spotting (KWS)** benchmark from [MLPerf Tiny](https://mlcommons.org/benchmarks/inference-tiny/):

- **Dataset**: Google Speech Commands v0.02
- **Task**: Streaming keyword detection
- **Target**: ≥95% accuracy with ≤8 FP and ≤8 FN in streaming
- **Result**: 99.83% accuracy, 0 FP, 1 FN — **all targets exceeded**

## Training Details

- **Dataset**: Google Speech Commands v0.02 (65,000+ 1-second audio clips)
- **Wake word**: "marvin"
- **Architecture**: Energy-Gated 1D DS-CNN, 64 channels
- **Epochs**: 30
- **Hardware**: NVIDIA RTX 4090

## Use Cases

- **Smart home devices** — always-on wake word detection at <1mW
- **Wearables** — hearing aids, fitness bands, smartwatches
- **Industrial IoT** — voice-activated controls in noisy environments
- **Automotive** — in-cabin voice trigger without cloud connectivity
- **Medical devices** — hands-free activation for clinical tools

## Citation

```bibtex
@misc{constantone2026wake,
  title={Constant Wake: Energy-Gated Keyword Spotting for Microcontrollers},
  author={ConstantOne AI},
  year={2026},
  url={https://huggingface.co/ConstantQJ/constant-wake-0.5}
}
```

## License

Apache 2.0 — use freely in commercial and non-commercial projects.

## Links

- [ConstantOne AI](https://constantone.ai)
- [Constant Edge 0.5 (Sentiment)](https://huggingface.co/ConstantQJ/constant-edge-0.5) — 1.46 MB sentiment analysis
- [API Documentation](https://constantone.ai/docs.html)