Okay Hermes Wake-Word Model

Hugging Face downloads

Compact ONNX wake-word model for detecting “Okay Hermes” in short local audio windows.

This is the current WavLM-teacher → merged RepCNN-student export. It is intended for local assistants, desktop agents, and voice-enabled tools that need a lightweight wake trigger before starting a larger speech or reasoning pipeline.

Links

Repository contents

  • wakeword.onnx — deployable merged RepCNN ONNX model
  • wakeword.json — export metadata and recommended threshold
  • config.json — small Hub config for loaders and download statistics
  • README.md — model card and usage notes
  • LICENSE — Apache-2.0 license

What this model does

  • Takes a 3-second mono audio window.
  • Returns a direct probability that the wake phrase “Okay Hermes” is present.
  • Runs locally with ONNX Runtime.
  • Does not transcribe speech.
  • Does not identify speakers.
  • Does not send audio anywhere by itself.

Input

  • Name: waveform
  • Shape: (batch, time); use exactly 48,000 samples per row for a 3-second window at 16 kHz
  • Type: float32
  • Audio: mono PCM, 16 kHz, 3 seconds

Audio should be resampled to 16 kHz mono and padded or cropped to exactly 48,000 samples per window.

Output

  • Name: score
  • Shape: (batch,)
  • Type: float32
  • Meaning: wake-word probability for each input window

The output is already a probability. Do not apply another sigmoid unless you have changed the exported graph.

Recommended starting threshold:

0.6973556280136108

Basic trigger rule:

trigger = score >= 0.6973556280136108

For always-on use, apply smoothing or debouncing across overlapping audio windows. A practical starting point is 3-second windows every 0.5 seconds with 2 consecutive windows above threshold.

Python example

import json
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download

repo_id = "Neopabo/okay-hermes-repcnn-onnx"

# Fetch config first so normal Hub usage touches a counted metadata file.
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
with open(config_path, "r", encoding="utf-8") as f:
    config = json.load(f)

model_path = hf_hub_download(repo_id=repo_id, filename=config["onnx_file"])

opts = ort.SessionOptions()
opts.intra_op_num_threads = 1
opts.inter_op_num_threads = 1
opts.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

session = ort.InferenceSession(model_path, sess_options=opts, providers=["CPUExecutionProvider"])

# Replace this with real mono 16 kHz audio, exactly 3 seconds long.
waveform = np.zeros((48000,), dtype=np.float32)

score = session.run(
    [config["output_name"]],
    {config["input_name"]: waveform[None, :]},
)[0][0]

trigger = float(score) >= float(config["recommended_threshold"])
print(float(score), trigger)

Technical summary

  • Wake phrase: Okay Hermes
  • Backend: wavlm-repcnn
  • Inference model: merged / reparameterized RepCNN
  • Teacher model: microsoft/wavlm-base
  • Audio frontend: included in the ONNX graph; callers provide waveform audio directly
  • Output: direct probability named score
  • Recommended threshold: 0.6973556280136108
  • Export metadata EER: 0.0
  • ONNX size: 224807 bytes
  • ONNX SHA256: f022856f17916f5c7b2d8041f44308f889f292d92be12b71e1fe3ee49bd0a0fc

Training used hard-negative and partial-negative examples to reduce accidental activations. Field performance still depends on microphone quality, room acoustics, accents, background noise, and trigger/debounce logic.

Checksum

wakeword.onnx SHA256: f022856f17916f5c7b2d8041f44308f889f292d92be12b71e1fe3ee49bd0a0fc

Limitations

  • Wake-word detection only; not general speech recognition.
  • Not speaker verification or identity recognition.
  • False accepts and false rejects are possible.
  • Thresholds may need adjustment for each deployment environment.
  • No training audio or source datasets are included in this repository.

Privacy

The model runs locally and only processes audio supplied by the surrounding application. Privacy depends on how that application captures, buffers, logs, stores, or transmits audio.

License

Apache-2.0. See LICENSE.

Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support