Okay Hermes Wake-Word Model

Compact ONNX wake-word model for detecting “Okay Hermes” in short local audio windows.

The current root-level v2 export is the WavLM-teacher → merged RepCNN-student model trained with the robust-v1 augmentation policy from wakeword-forge. Compared with the previous public artifact, v2 explicitly covers low-gain microphones, far-field listening, onset jitter, noise/reverb assets, and spectrogram masking.

Repository contents

Current root artifact filenames:

wakeword_v2.onnx
wakeword_v2.json
config.json (points to the v2 files)

Archived v1 artifact filenames live under archive/v1/ and include _v1 in their names.

archive/ — previous public model artifacts retained for compatibility/history with v1 filenames
wakeword_v2.onnx — current v2 deployable merged RepCNN ONNX model
wakeword_v2.json — v2 export metadata, robust-v1 augmentation details, checksum, and thresholds
config.json — canonical Hub config; points to wakeword_v2.onnx and wakeword_v2.json
README.md — model card and usage notes
LICENSE — Apache-2.0 license

What this model does

Takes a 3-second mono audio window.
Returns a direct probability that the wake phrase “Okay Hermes” is present.
Runs locally with ONNX Runtime.
Does not transcribe speech.
Does not identify speakers.
Does not send audio anywhere by itself.

Input

Name: waveform
Shape: (batch, time); use exactly 48,000 samples per row for a 3-second window at 16 kHz
Type: float32
Audio: mono PCM, 16 kHz, 3 seconds

Audio should be resampled to 16 kHz mono and padded or cropped to exactly 48,000 samples per window.

Output

Name: score
Shape: (batch,)
Type: float32
Meaning: wake-word probability for each input window

The output is already a probability. Do not apply another sigmoid unless you have changed the exported graph.

Thresholds

Recommended deployment threshold:

0.6973556280136108

The trained EER threshold is lower:

0.5128782391548157

The deployment threshold is intentionally higher because the v2 model gives a non-trigger zero-audio score around 0.624035120010376. Start with the recommended threshold, then tune upward for false wakes or downward for missed activations in your own room and microphone setup.

Basic trigger rule:

trigger = score >= 0.6973556280136108

For always-on use, apply smoothing or debouncing across overlapping audio windows. A practical starting point is 3-second windows every 0.5 seconds with 1–2 consecutive windows above threshold.

Python example

import json
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download

repo_id = "Neopabo/okay-hermes-repcnn-onnx"

# Fetch config first so normal Hub usage touches a counted metadata file.
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
with open(config_path, "r", encoding="utf-8") as f:
    config = json.load(f)

model_path = hf_hub_download(repo_id=repo_id, filename=config["onnx_file"])  # wakeword_v2.onnx

opts = ort.SessionOptions()
opts.intra_op_num_threads = 1
opts.inter_op_num_threads = 1
opts.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

session = ort.InferenceSession(model_path, sess_options=opts, providers=["CPUExecutionProvider"])

# Replace this with real mono 16 kHz audio, exactly 3 seconds long.
waveform = np.zeros((48000,), dtype=np.float32)

score = session.run(
    [config["output_name"]],
    {config["input_name"]: waveform[None, :]},
)[0][0]

trigger = float(score) >= float(config["recommended_threshold"])
print(float(score), trigger)

Technical summary

Wake phrase: Okay Hermes
Version: v2
Backend: wavlm-repcnn
Inference model: merged / reparameterized RepCNN
Teacher model: microsoft/wavlm-base
Audio frontend: included in the ONNX graph; callers provide waveform audio directly
Output: direct probability named score
ONNX input shape: dynamic time axis named time; use 48,000 samples for the documented 3-second window
Recommended deployment threshold: 0.6973556280136108
Trained EER threshold: 0.5128782391548157
Evaluation EER: 0.0524218154080854
Zero-audio score: 0.624035120010376
Augmentation policy: robust-v1
ONNX size: 225757 bytes
ONNX SHA256: 647fee171d7d98fa1672ec26f65f68a135471aea34bbdac4ede34693b89a68e1

Robust-v1 augmentation coverage

The v2 training metadata records:

low-gain microphone attenuation and noise-floor simulation
far-field attenuation, room mix, and background SNR simulation
onset jitter
Gaussian/SNR noise, band-pass/stop/high-pass/low-pass filters
gain transitions, time masks, speed/pitch variation, polarity flips, clipping
spectrogram frequency/time masking and noise
public metadata counts for background-noise, room impulse response, short-noise, and truck-noise augmentation assets

No training audio or source datasets are included in this repository.

Checksum

wakeword_v2.onnx SHA256: 647fee171d7d98fa1672ec26f65f68a135471aea34bbdac4ede34693b89a68e1

Limitations

Wake-word detection only; not general speech recognition.
Not speaker verification or identity recognition.
False accepts and false rejects are possible.
Thresholds may need adjustment for each deployment environment.
No training audio or source datasets are included in this repository.

Privacy

The model runs locally and only processes audio supplied by the surrounding application. Privacy depends on how that application captures, buffers, logs, stores, or transmits audio.

License

Apache-2.0. See LICENSE.

Downloads last month: 6

Neopabo
/

okay-hermes-repcnn-onnx